<a href="https://colab.research.google.com/github/etuckerman/SOCOTEC/blob/main/SOCOTEC_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU found!")


GPU: NVIDIA A100-SXM4-40GB


In [2]:
import torch

# Enable mixed precision for faster computations on A100
torch.set_default_dtype(torch.float16)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True


In [3]:
!nvidia-smi


Tue Jan  7 00:27:58 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0              42W / 400W |      5MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# RAG PIPELINE

In [11]:
!pip install llama_parse huggingface_hub langchain chromadb nest_asyncio langchain-community unstructured langchain-huggingface


Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Downloading langchain_huggingface-0.1.2-py3-none-any.whl (21 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-0.1.2


# Loading and Preprocessing

In [None]:
import nest_asyncio
from llama_parse import LlamaParse

# Apply nest_asyncio to handle the event loop
nest_asyncio.apply()

# Initialize the LlamaParse parser with optimized parsing instructions
parser = LlamaParse(
    api_key="llx-ZTieolOu9t8Ks9FvurLVGbBujjpap5s63nI0PHXsv4EV4szb",
    result_type="markdown",  # Retain markdown format for structured output
    language="en",  # Set to English since the IBC is in English
    verbose=True,  # Enable detailed logs to monitor parsing performance
    is_formatting_instruction=True,  # Preserve formatting for context retrieval
    parsing_instruction="""
        Extract the following key elements from the document:
        1. Chapter titles and their numbers.
        2. Section headings and subheadings with their corresponding numbers.
        3. Key definitions and terms listed in the document.
        4. Detailed descriptions of occupancy classifications, fire-resistance requirements, and structural design criteria.
        5. All tables and their captions, including their associated data.
        6. Any reference codes, figures, or diagrams mentioned in the text.
        Format the extracted data in a structured and readable manner, preserving markdown styling for clarity (e.g., **bold** headings, bullet points for lists, etc.).
    """
)


# Parse the syllabus document
parsed_documents = parser.load_data("/content/IBC.pdf")

# Save the parsed results to a markdown or any preferred format
with open('IBC.md', 'w') as f:
    for doc in parsed_documents:
        f.write(doc.text + '\n')


# Embedding and Vector Store setup

When processing such a substantial document for a Retrieval-Augmented Generation (RAG) system, it's crucial to optimize the text chunking and embedding process to balance performance and accuracy.

Optimizing Text Chunking and Embedding:

Text Chunking:

Chunk Size: Given the document's length, consider setting the chunk_size to 1500 characters. This size is manageable for most language models and ensures that each chunk contains sufficient context.
Overlap: Maintain an overlap of 100 characters (chunk_overlap=100). This overlap helps preserve context between chunks, which is beneficial for understanding references across sections.
Embeddings:

Model Selection: The all-MiniLM-L6-v2 model is efficient and effective for generating embeddings. It's a good choice for balancing performance and computational efficiency.
Vector Store: Utilize Chroma as the vector store. It's optimized for handling large datasets and supports efficient similarity searches.

In [14]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


# Load the parsed markdown document
loader = UnstructuredMarkdownLoader("IBC.md")
docs = loader.load()

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)
texts = text_splitter.split_documents(docs)

# Create embeddings and vector store
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(texts, embeddings)
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 2})


# MODEL SETUP

In [18]:
# Step 3: Load the Qwen Model
from transformers import pipeline
from langchain_huggingface import HuggingFacePipeline

qwen_pipe = pipeline(
    "text-generation",
    model="Qwen/Qwen2.5-7B",
    tokenizer="Qwen/Qwen2.5-7B",
    device=0  # Use GPU
)
qwen_llm = HuggingFacePipeline(pipeline=qwen_pipe)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:0


OutOfMemoryError: CUDA out of memory. Tried to allocate 260.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 200.81 MiB is free. Process 8206 has 39.36 GiB memory in use. Of the allocated memory 38.68 GiB is allocated by PyTorch, and 191.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## Refine Prompt Template

In [15]:
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=(
        "Given the following context, provide a concise answer to the question:\n\n"
        "{context}\n\n"
        "Question: {question}\n"
        "Answer:"
    ),
)

ImportError: cannot import name 'HuggingFaceLLM' from 'langchain.llms' (/usr/local/lib/python3.10/dist-packages/langchain/llms/__init__.py)

## Setup RetrivalQA Chain

In [None]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_llm(llm=qwen_llm, retriever=retriever)

# Step 6: Test the RAG System
question_1 = "What is the purpose of Appendix B: Board of Appeals?"
response_1 = qa_chain.run(question_1)
print(f"Answer 1: {response_1}")

question_2 = "Explain the key concepts discussed in the document?"
response_2 = qa_chain.run(question_2)
print(f"Answer 2: {response_2}")

KeyboardInterrupt: 

In [None]:
# Example IBC-specific questions
questions = [
    "What is the purpose of Appendix B: Board of Appeals?",
    "What are the occupancy classifications defined in Chapter 3?",
    "How does the IBC define mixed-use occupancies?",
    "What are the fire-resistance requirements for Type I construction?",
    "What are the minimum design loads for buildings and structures?"
]

# Loop through and retrieve answers
for question in questions:
    response = qa_chain.run(question)
    print(f"Question: {question}\nAnswer: {response}\n")
