# RAG Similarity Search with LlamaIndex

(Auto Scrape Iteration 2 Part 3)

Once the scrape is completed and cleaned, one option is to send it through a LlamaIndex pipeline and store in Qdrant vector store for retrieval. Import the necessary libraries as below

In [1]:
import qdrant_client
from llama_index.core import SimpleDirectoryReader
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core.indices import MultiModalVectorStoreIndex
import yaml


Extract the metadata from the YAML section

In [2]:
def get_document_metadata(filepath):
    with open(filepath, 'r') as file:
        content = file.read()
        _, front_matter, _ = content.split('---', 2)
        data = yaml.safe_load(front_matter)
    return data

Create the Qdrant client and set up Vector Stores for text and images. Kept seperate no multimodal embeddings model (e.g. CLIP) can handle the chunking needs for text documents (i.e. CLIP limits number of characters per chunk while each chunk we need needs significantly more than that)

In [3]:
# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_db")

In [4]:
text_store = QdrantVectorStore(
    client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
    client=client, collection_name="image_collection"
)
storage_context = StorageContext.from_defaults(
    vector_store=text_store, image_store=image_store
)

Load all files into the LlamaIndex pipeline

In [11]:
documents = SimpleDirectoryReader(input_dir="../data/Development-Control-jina", recursive=True, file_metadata=get_document_metadata).load_data()

Create the Multimodal index

In [13]:
index = MultiModalVectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
)

And run the query! The output parameters can be changed - but in this case we want only the top 3 chunks and their metadata

In [15]:
test_query = "What is the criteria for higher GFA for the SDI programme?"
# generate  retrieval results
retriever = index.as_retriever(similarity_top_k=3)
retrieval_results = retriever.retrieve(test_query)
from llama_index.core.response.notebook_utils import display_source_node
for res_node in retrieval_results:
    display_source_node(res_node, source_length=1000, show_source_metadata=True)

**Node ID:** ebe68476-5fab-4ca8-a164-09b60c125811<br>**Similarity:** 0.8581828758601634<br>**Text:** Evaluation Criteria

Redevelopment proposals submitted under the SDI scheme shall be evaluated based on the following criteria:

**SDI Scheme Evaluation Criteria**

Urban Design and Architectural Design ConceptThe proposed project shall be a quality development that defines the site as a distinctive destination through its architectural design, scale, presence and setting in relation to the surrounding developments, pedestrian network, and the public realm.Environmental Improvement/ Contribution to the CommunityThe proposed project should enhance the public environment in a significant way and benefit the community at large, such as through:Quality public spaces;Measures designed to encourage the use of public transport and to discourage private car use;Enhanced pedestrian networks and promotion of active mobility;Public or cultural facilities (eg event and performance art venues, childcare facilities, and community services etc.);Enhancement to public infrastructure;Conservati...<br>**Metadata:** {'title': 'Bonus GFA Incentive Schemes', 'link': 'https://www.ura.gov.sg/Corporate/Guidelines/Development-Control/Non-Residential/Commercial/GFA-Incentive-Schemes', 'date': '21 November 2022'}<br>

**Node ID:** c969efa5-a954-4960-9f23-e6704229a048<br>**Similarity:** 0.8568873009922435<br>**Text:** Conditions

The evaluation by URA under the SDI Scheme is proposal-specific. An application that has been previously approved by the URA shall not be used or taken as a precedent for any other proposals or development applications seeking similar deviations from the planning parameters.

Any additional incentive GFA or development intensity granted under the SDI Scheme may be subject to SLA levying Land Betterment Charge, where applicable.

Any increase in development intensity approved by URA under this scheme shall not count towards the future development potential of the subject site.

Bonus GFA shall not apply for requirements mandated as part of the SDI Scheme. For example, if a minimum Green Mark score is required, the prevailing Green Mark Bonus GFA shall not apply. The developments will still be eligible for Bonus GFA granted under other applicable schemes such as balcony or indoor recreational spaces, subject to the prevailing overall cap on Bonus GFA.

Lease renew...<br>**Metadata:** {'title': 'Bonus GFA Incentive Schemes', 'link': 'https://www.ura.gov.sg/Corporate/Guidelines/Development-Control/Non-Residential/Commercial/GFA-Incentive-Schemes', 'date': '21 November 2022'}<br>

**Node ID:** 64153537-c626-48af-b3fa-422c2af1426f<br>**Similarity:** 0.8487512873565062<br>**Text:** Eligibility

The SDI Scheme is open to applications from building owners for developments in strategic areas across Singapore that meet the eligibility criteria.

In particular, applications to redevelop existing developments in Orchard Road, Central Business District (CBD) and Marina Centre areas are encouraged, in line with the broader planning intention to rejuvenate these areas. Sites that fall within the designated areas for the CBD Incentive Scheme shall be guided by that scheme instead of being considered under the SDI scheme.

Eligibility for consideration under the SDI Scheme is subject to the criteria outlined in the table below:

**SDI Scheme Eligibility Requirements**

Age of DevelopmentAt least 20 years old from date of last TOPExisting Land UseCommercial or mixed-use developments with predominantly commercial uses.Developments with predominantly residential uses will not be eligible.Transformational ImpactThe redevelopment proposal shall include a minimum of t...<br>**Metadata:** {'title': 'Bonus GFA Incentive Schemes', 'link': 'https://www.ura.gov.sg/Corporate/Guidelines/Development-Control/Non-Residential/Commercial/GFA-Incentive-Schemes', 'date': '21 November 2022'}<br>

Close the client after use. If there is intent to repopulate or update the storage items, uncomment the line and delete the collection

In [None]:
#client.delete_collection(collection_name="text_collection")
#client.delete_collection(collection_name="image_collection")

In [16]:
client.close()

Helper function to find weird characters manually

In [None]:
def char_at_byte_index(file_path, byte_index, encoding='utf-8'):
    with open(file_path, 'rb') as file:
        # Seek to the byte index
        file.seek(byte_index)
        
        # Read one byte
        byte = file.read(1)
        
        # Decode the byte to a character
        char = byte.decode(encoding)
        
        return char

# Example usage
file_path = '..\\data\\DC-cleaned-md\\Non-Residential\\Hotel\\Waterbodies.md'
byte_index = 2933  # Replace with the desired byte index
character = char_at_byte_index(file_path, byte_index)
print(f"The character at byte index {byte_index} is: '{character}'")