<a href="https://colab.research.google.com/github/etuckerman/SOCOTEC/blob/main/SOCOTEC_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU found!")


GPU: NVIDIA A100-SXM4-40GB


In [2]:
import torch

# Enable mixed precision for faster computations on A100
torch.set_default_dtype(torch.float16)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True


In [3]:
%%capture
!pip install llama_parse huggingface_hub langchain chromadb nest_asyncio langchain-community unstructured langchain-huggingface


In [4]:
!nvidia-smi


Tue Jan  7 17:18:23 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0              43W / 400W |      5MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# RAG PIPELINE

# Loading and Preprocessing

In [5]:
import nest_asyncio
from llama_parse import LlamaParse

# Apply nest_asyncio to handle the event loop
nest_asyncio.apply()

### BASIC PARSING
# # Initialize the LlamaParse parser with optimized parsing instructions
# parser = LlamaParse(
#     api_key="llx-ZTieolOu9t8Ks9FvurLVGbBujjpap5s63nI0PHXsv4EV4szb",
#     result_type="markdown",  # Retain markdown format for structured output
#     language="en",  # Set to English since the IBC is in English
#     verbose=True,  # Enable detailed logs to monitor parsing performance
#     is_formatting_instruction=True,  # Preserve formatting for context retrieval
#     parsing_instruction="""
#         Extract the following key elements from the document:
#         1. Chapter titles and their numbers.
#         2. Section headings and subheadings with their corresponding numbers.
#         3. Key definitions and terms listed in the document.
#         4. Detailed descriptions of occupancy classifications, fire-resistance requirements, and structural design criteria.
#         5. All tables and their captions, including their associated data.
#         6. Any reference codes, figures, or diagrams mentioned in the text.
#         Format the extracted data in a structured and readable manner, preserving markdown styling for clarity (e.g., **bold** headings, bullet points for lists, etc.).
#     """
# )

### OPTIMISED PARSING TEST [currently costs 30$ so i cancelled it]
# Initialize the LlamaParse parser with optimized parameters
parser = LlamaParse(
    api_key="llx-ZTieolOu9t8Ks9FvurLVGbBujjpap5s63nI0PHXsv4EV4szb",
    is_remote=False,  # Processing locally for faster iterations
    verbose=True,  # Keep verbose for detailed logs
    show_progress=True,  # Show progress for better tracking
    language="en",  # Document language is English
    split_by_page=True,  # Process document page by page for modularity
    result_type="markdown",  # Export as markdown for better structuring
    max_timeout=3000,  # Increase timeout for processing large documents
    num_workers=6,  # Utilize 6 workers for concurrent processing
    parsing_instruction=(
        "Extract all critical information, including definitions, tables, figures, and important text "
        "relevant to occupancy classifications, construction types, fire-resistance requirements, "
        "design loads, and any other regulations. Focus on sections that may aid in answering queries."
    ),
    structured_output=False,  # Output as plain markdown, structured parsing is unnecessary here
    annotate_links=True,  # Annotate links for better context during retrieval
    auto_mode=True,  # Enable auto mode to trigger optimizations for certain elements
    auto_mode_trigger_on_table_in_page=True,  # Prioritize tables (highly structured info)
    auto_mode_trigger_on_image_in_page=True,  # Include charts/diagrams for completeness
    disable_ocr=False,  # Allow OCR for text in non-standard formats
    extract_charts=True,  # Include chart data in the parsed output
    extract_layout=False,  # Skip layout info, focusing purely on content
    premium_mode=True,  # Enable premium processing for improved accuracy
    page_separator="\n\n---\n\n",  # Separate pages clearly for retrieval
    max_pages=None,  # Process the entire document
    continuous_mode=False,  # Avoid continuous mode; keep pages distinct
)


# Parse the syllabus document
parsed_documents = parser.load_data("/content/IBC.pdf")

# Save the parsed results to a markdown or any preferred format
with open('IBC.md', 'w') as f:
    for doc in parsed_documents:
        f.write(doc.text + '\n')


Started parsing the file under job_id 09ab8e9f-7e24-47a3-a891-9971481e4ae3
.....

KeyboardInterrupt: 

# Embedding and Vector Store setup

When processing such a substantial document for a Retrieval-Augmented Generation (RAG) system, it's crucial to optimize the text chunking and embedding process to balance performance and accuracy.

Optimizing Text Chunking and Embedding:

Text Chunking:

Chunk Size: Given the document's length, consider setting the chunk_size to 1500 characters. This size is manageable for most language models and ensures that each chunk contains sufficient context.
Overlap: Maintain an overlap of 100 characters (chunk_overlap=100). This overlap helps preserve context between chunks, which is beneficial for understanding references across sections.
Embeddings:

Model Selection: The all-MiniLM-L6-v2 model is efficient and effective for generating embeddings. It's a good choice for balancing performance and computational efficiency.
Vector Store: Utilize Chroma as the vector store. It's optimized for handling large datasets and supports efficient similarity searches.

In [5]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


# Load the parsed markdown document
loader = UnstructuredMarkdownLoader("IBC.md")
docs = loader.load()

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)
texts = text_splitter.split_documents(docs)

# Create embeddings and vector store
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(texts, embeddings)
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 2})


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# MODEL SETUP

In [6]:
# Step 3: Load the Qwen Model
from transformers import pipeline
from langchain_huggingface import HuggingFacePipeline

qwen_pipe = pipeline(
    "text-generation",
    model="Qwen/Qwen2.5-7B",
    tokenizer="Qwen/Qwen2.5-7B",
    device=0  # Use GPU
)
qwen_llm = HuggingFacePipeline(pipeline=qwen_pipe)

config.json:   0%|          | 0.00/686 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.56G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.23k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Device set to use cuda:0


## Refine Prompt Template

In [12]:
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["context", "query"],
    template=(
        "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
        "You have the knowledge of the IBC 2018 International Building Code book."
        "You use this knowledge to answer queries to users, don't reference the document in third person, just speak as if you know the information."
        "Given the following context, provide a concise answer to the query.:\n\n"
        "{context}\n\n"
        "Query: {query}\n"
        "Response:"
    ),
)

## Setup RetrivalQA Chain

In [13]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_llm(llm=qwen_llm, retriever=retriever)

# Step 6: Test the RAG System
query_1 = "What is the purpose of Appendix B: Board of Appeals?"
response_1 = qa_chain.invoke({"query": query_1})
print(f"Answer 1: {response_1}")


Answer 1: {'query': 'What is the purpose of Appendix B: Board of Appeals?', 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nContext:\nAppendix A Employee Qualifications. Effective administration and enforcement of the family of International Codes depends on the training and expertise of the personnel employed by the jurisdiction and his or her knowledge of the codes. Section 103 of the code establishes the Depart- ment of Building Safety and calls for the appointment of a building official and deputies such as plans examiners and inspectors. Appendix A provides standards for experience, training and certifi- cation for the building official and the other staff mentioned in Chapter 1.\n\nAppendix B Board of Appeals. Section 113 of Chapter 1 requires the establishment of a board of appeals to hear appeals regarding determinations made by the building official.

In [11]:

query_2 = "Explain the key concepts discussed in the document?"
response_2 = qa_chain.invoke({"query": query_2})
print(f"Answer 2: {response_2}")


Answer 2: {'query': 'Explain the key concepts discussed in the document?', 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nContext:\nChapter 29 of IBC correlates with Chapters 3 & 4 of IPC for plumbing fixtures and facilities\n\nThe image also provides brief descriptions of Chapters 1 and 2 of the IBC:\n\nChapter 1 establishes the scope, applicability, and administration of the code.\n\nChapter 2 contains definitions of terms used throughout the code.\n\nThe document emphasizes the importance of every word, term, and punctuation mark in the code, as they can impact the meaning and intended results of the code provisions.\n\nmeaning in the code and the code meaning can differ substantially from the ordinarily understood meaning of the term as used outside of the code. Where understanding of a term's definition is especially key to or necessary for understandin

In [None]:
# Example IBC-specific questions
queries = [
    "What is the purpose of Appendix B: Board of Appeals?",
    "What are the occupancy classifications defined in Chapter 3?",
    "How does the IBC define mixed-use occupancies?",
    "What are the fire-resistance requirements for Type I construction?",
    "What are the minimum design loads for buildings and structures?"
]


In [None]:
# Loop through and retrieve answers
for query in queries:
    response = qa_chain.invoke({"query": query})
    print(f"Query: {query}\nAnswer: {response}\n")
