<a href="https://colab.research.google.com/github/etuckerman/SOCOTEC/blob/main/SOCOTEC_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU found!")


GPU: NVIDIA A100-SXM4-40GB


In [2]:
import torch

# Enable mixed precision for faster computations on A100
torch.set_default_dtype(torch.float16)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True


In [3]:
%%capture
!pip install llama_parse huggingface_hub langchain chromadb nest_asyncio langchain-community unstructured langchain-huggingface gradio

In [4]:
!nvidia-smi


Tue Jan  7 17:56:43 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0              43W / 400W |      5MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# RAG PIPELINE

# Loading and Preprocessing

In [5]:
import nest_asyncio
from llama_parse import LlamaParse

# Apply nest_asyncio to handle the event loop
nest_asyncio.apply()

### BASIC PARSING
# # Initialize the LlamaParse parser with optimized parsing instructions
# parser = LlamaParse(
#     api_key="llx-ZTieolOu9t8Ks9FvurLVGbBujjpap5s63nI0PHXsv4EV4szb",
#     result_type="markdown",  # Retain markdown format for structured output
#     language="en",  # Set to English since the IBC is in English
#     verbose=True,  # Enable detailed logs to monitor parsing performance
#     is_formatting_instruction=True,  # Preserve formatting for context retrieval
#     parsing_instruction="""
#         Extract the following key elements from the document:
#         1. Chapter titles and their numbers.
#         2. Section headings and subheadings with their corresponding numbers.
#         3. Key definitions and terms listed in the document.
#         4. Detailed descriptions of occupancy classifications, fire-resistance requirements, and structural design criteria.
#         5. All tables and their captions, including their associated data.
#         6. Any reference codes, figures, or diagrams mentioned in the text.
#         Format the extracted data in a structured and readable manner, preserving markdown styling for clarity (e.g., **bold** headings, bullet points for lists, etc.).
#     """
# )

### OPTIMISED PARSING TEST [currently costs 30$ so i cancelled it]
# Initialize the LlamaParse parser with optimized parameters
parser = LlamaParse(
    api_key="llx-ZTieolOu9t8Ks9FvurLVGbBujjpap5s63nI0PHXsv4EV4szb",
    is_remote=False,  # Processing locally for faster iterations
    verbose=True,  # Keep verbose for detailed logs
    show_progress=True,  # Show progress for better tracking
    language="en",  # Document language is English
    split_by_page=True,  # Process document page by page for modularity
    result_type="markdown",  # Export as markdown for better structuring
    max_timeout=3000,  # Increase timeout for processing large documents
    num_workers=6,  # Utilize 6 workers for concurrent processing
    parsing_instruction=(
        "Extract all critical information, including definitions, tables, figures, and important text "
        "relevant to occupancy classifications, construction types, fire-resistance requirements, "
        "design loads, and any other regulations. Focus on sections that may aid in answering queries."
    ),
    structured_output=False,  # Output as plain markdown, structured parsing is unnecessary here
    annotate_links=True,  # Annotate links for better context during retrieval
    auto_mode=True,  # Enable auto mode to trigger optimizations for certain elements
    auto_mode_trigger_on_table_in_page=True,  # Prioritize tables (highly structured info)
    auto_mode_trigger_on_image_in_page=True,  # Include charts/diagrams for completeness
    disable_ocr=False,  # Allow OCR for text in non-standard formats
    extract_charts=True,  # Include chart data in the parsed output
    extract_layout=False,  # Skip layout info, focusing purely on content
    premium_mode=True,  # Enable premium processing for improved accuracy
    page_separator="\n\n---\n\n",  # Separate pages clearly for retrieval
    max_pages=None,  # Process the entire document
    continuous_mode=False,  # Avoid continuous mode; keep pages distinct
)


# Parse the syllabus document
parsed_documents = parser.load_data("/content/IBC.pdf")

# Save the parsed results to a markdown or any preferred format
with open('IBC.md', 'w') as f:
    for doc in parsed_documents:
        f.write(doc.text + '\n')


Started parsing the file under job_id 09ab8e9f-7e24-47a3-a891-9971481e4ae3
.....

KeyboardInterrupt: 

# Embedding and Vector Store setup

When processing such a substantial document for a Retrieval-Augmented Generation (RAG) system, it's crucial to optimize the text chunking and embedding process to balance performance and accuracy.

Optimizing Text Chunking and Embedding:

Text Chunking:

Chunk Size: Given the document's length, consider setting the chunk_size to 1500 characters. This size is manageable for most language models and ensures that each chunk contains sufficient context.
Overlap: Maintain an overlap of 100 characters (chunk_overlap=100). This overlap helps preserve context between chunks, which is beneficial for understanding references across sections.
Embeddings:

Model Selection: The all-MiniLM-L6-v2 model is efficient and effective for generating embeddings. It's a good choice for balancing performance and computational efficiency.
Vector Store: Utilize Chroma as the vector store. It's optimized for handling large datasets and supports efficient similarity searches.

In [5]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


# Load the parsed markdown document
loader = UnstructuredMarkdownLoader("IBC.md")
docs = loader.load()

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)
texts = text_splitter.split_documents(docs)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:

# Create embeddings and vector store
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(texts, embeddings)
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 2})


In [None]:
#---------testing pinecone--------------------#

In [15]:
!pip install pinecone-client


Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting pinecone-plugin-inference<2.0.0,>=1.0.3 (from pinecone-client)
  Downloading pinecone_plugin_inference-1.1.0-py3-none-any.whl.metadata (2.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone-client)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone_client-5.0.1-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/244.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_inference-1.1.0-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.4/85.4 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone-plugin-inference, pinecone-client
Successfully installed pinecone-client-5.0.

In [16]:
from pinecone import Pinecone

# Initialize Pinecone
pc = Pinecone(api_key="pcsk_Ss366_QyM73ktnEYF6CJ29MHHBrhcBhU3XEPH1W61EKRuMo1oemoU2CJCnbcVL3WPqW6M")


In [23]:
import os
from pinecone import Pinecone

# Initialize Pinecone with your API key
api_key = "pcsk_Ss366_QyM73ktnEYF6CJ29MHHBrhcBhU3XEPH1W61EKRuMo1oemoU2CJCnbcVL3WPqW6M"
pc = Pinecone(api_key=api_key)

# Define a smaller batch size for splitting the texts
batch_size = 10  # Adjust as needed

# Split the texts into smaller batches
text_batches = [texts[i:i + batch_size] for i in range(0, len(texts), batch_size)]

# Initialize Pinecone index (existing "socotec" index)
index_name = "socotec"  # Your existing index name
index = pc.Index(index_name)  # Access the "socotec" index

# Generate embeddings for each batch and upsert them into Pinecone
for batch in text_batches:
    # Generate embeddings for the batch using the correct model
    embeddings = pc.inference.embed(
        model="text-embedding-3-large",  # Use the model matching your index
        inputs=[text.page_content for text in batch],  # Use .page_content for each batch
        parameters={"input_type": "passage"}
    )

    # Prepare the vectors for upsertion
    vectors = []
    for idx, emb in enumerate(embeddings):
        vectors.append({
            "id": f"doc-{idx}",  # Generate a unique ID for each document
            "values": emb["values"],  # Embedding values from the response
            "metadata": {"text": batch[idx].page_content}  # Store the original text in metadata
        })

    # Upsert vectors into the Pinecone index
    index.upsert(vectors=vectors, namespace="ibc_namespace")  # You can change the namespace if needed

# Confirm that the vectors are inserted correctly
print("Embeddings have been upserted successfully!")


NotFoundException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'x-pinecone-api-version': '2024-10', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'X-Cloud-Trace-Context': '600c71e905ec1207fe05e3777f8cf415', 'Date': 'Tue, 07 Jan 2025 18:32:21 GMT', 'Server': 'Google Frontend', 'Content-Length': '96', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"NOT_FOUND","message":"Model 'text-embedding-3-large' not found"},"status":404}


# MODEL SETUP

In [6]:
# Step 3: Load the Qwen Model
from transformers import pipeline
from langchain_huggingface import HuggingFacePipeline

qwen_pipe = pipeline(
    "text-generation",
    model="Qwen/Qwen2.5-7B",
    tokenizer="Qwen/Qwen2.5-7B",
    device=0  # Use GPU
)
qwen_llm = HuggingFacePipeline(pipeline=qwen_pipe)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:0


## Refine Prompt Template

In [25]:
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["context", "query"],
    template=(
        "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
        "You have extensive knowledge of the IBC 2018 International Building Code."
        "Answer the following query based on your knowledge of the IBC, as if you are already familiar with the content."
        "Do not mention or reference any specific document or context. Just provide a direct and concise answer."
        "Query: {query}\n"
        "Response:"
    ),
)


## Setup RetrivalQA Chain

In [None]:

# # Step 6: Test the RAG System
# query_1 = "What is the purpose of Appendix B: Board of Appeals?"
# response_1 = qa_chain.invoke({"query": query_1})
# print(f"Answer 1: {response_1}")


In [11]:

# query_2 = "Explain the key concepts discussed in the document?"
# response_2 = qa_chain.invoke({"query": query_2})
# print(f"Answer 2: {response_2}")


Answer 2: {'query': 'Explain the key concepts discussed in the document?', 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nContext:\nChapter 29 of IBC correlates with Chapters 3 & 4 of IPC for plumbing fixtures and facilities\n\nThe image also provides brief descriptions of Chapters 1 and 2 of the IBC:\n\nChapter 1 establishes the scope, applicability, and administration of the code.\n\nChapter 2 contains definitions of terms used throughout the code.\n\nThe document emphasizes the importance of every word, term, and punctuation mark in the code, as they can impact the meaning and intended results of the code provisions.\n\nmeaning in the code and the code meaning can differ substantially from the ordinarily understood meaning of the term as used outside of the code. Where understanding of a term's definition is especially key to or necessary for understandin

In [None]:
# # Example IBC-specific questions
# queries = [
#     "What is the purpose of Appendix B: Board of Appeals?",
#     "What are the occupancy classifications defined in Chapter 3?",
#     "How does the IBC define mixed-use occupancies?",
#     "What are the fire-resistance requirements for Type I construction?",
#     "What are the minimum design loads for buildings and structures?"
# ]


In [None]:
# # Loop through and retrieve answers
# for query in queries:
#     response = qa_chain.invoke({"query": query})
#     print(f"Query: {query}\nAnswer: {response}\n")


In [10]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_llm(llm=qwen_llm, retriever=retriever)


##Gradio implementation




In [None]:
import gradio as gr
import torch
from langchain.chains import RetrievalQA

# Define the function that will display the results in a more readable format
def query_rag_system(query):
    # Use the qa_chain.invoke to get the response for the query
    response = qa_chain.invoke({"query": query})
    # Return the response in a user-friendly format (you can format it as needed)
    return response.get('result', "No result found")

# Create a Gradio interface
interface = gr.Interface(
    fn=query_rag_system,  # This is the function that will be called to generate the output
    inputs=gr.Textbox(label="Enter your query"),  # The input for the user query
    outputs=gr.Textbox(label="RAG System Answer", lines=20),  # The output for displaying the result
    live=True,  # Optional: Allows for live updates as the user types
    title="RAG Query Interface",  # Title for the interface
    description="Enter a query related to the IBC 2018 International Building Code, and the system will provide an answer based on the context."
)

# Launch the interface
interface.launch(debug=True, share=True)


Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://aa447d822108355542.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
