# Fire Clauses

This notebook processes the fire clauses into `fire-clauses.json` file and inserts it into Qdrant.

## Creating the embeddings

In [2]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import os

In [3]:
DATA_DIR = os.path.join(os.getcwd(), '..', 'data')
DATA_DIR

'C:\\Uni\\SE700 - Research Project\\ConstructQA\\backend\\notebooks\\..\\data'

In [4]:
# Read the fire clauses into a dataframe
path = os.path.join(DATA_DIR, 'fire-clauses.json')
df = pd.read_json(path)

# Because the clause is unique, we can use it as the index
# Handle limitOnApplication NaNs by replacing with empty string
df.set_index('clause', inplace=True)
df['limitOnApplication'].fillna('', inplace=True)

# First 10 clauses
df.head(10)

Unnamed: 0_level_0,content,limitOnApplication
clause,Unnamed: 1_level_1,Unnamed: 2_level_1
C1—Objectives of clauses C2 to C6 (protection from fire),The objectives of clauses C2 to C6 are to: (a)...,
C2.1,Fixed appliances using controlled combustion a...,
C2.2,The maximum surface temperature of combustible...,
C2.3,Fixed appliances using controlled combustion a...,
C3.1,Buildings must be designed and constructed so ...,
C3.2,Buildings with a building height greater than ...,Clause C3.2 does not apply to importance level...
C3.3,Buildings must be designed and constructed so ...,
C3.5,Buildings must be designed and constructed so ...,
C3.6,Buildings must be designed and constructed so ...,
C3.7,External walls of buildings that are located c...,


In [5]:
# Load in the sentence transformer model - have a look and the comparisons here:
# https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/
# multi-qa-MiniLM-L6-cos-v1 is trained for QA and is smaller with very minor loss in performance

model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

In [6]:
# Encode the clause contents to create sentence vector embeddings (combine `content` and `limitOnApplication`)

sentences = (df['content'] + ' ' + df['limitOnApplication']).tolist()
vectors = model.encode(sentences, show_progress_bar=True)

# Expect a mxn matrix where m is the number of clauses and n is the embedding dimension of the model
vectors.shape

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(12, 384)

In [12]:
# Save the vectors to a numpy file for the script to load and insert into Qdrant

save_path = os.path.join(DATA_DIR, 'fire-clauses.npy')
np.save(save_path, vectors, allow_pickle=False)

## Manual Test Query

Make sure that our vectors have been converted expectedly where we manually search for a clause and find the closest match (we don't use Qdrant here yet)

In [7]:
from sentence_transformers import util

In [8]:
# Target clause C3.8
question = 'How high must the smoke be above the floor when firefighters put out a fire with water?'
context = df[df.index == 'C3.8']['content'].values[0]

print(f'Question: {question}')
print()
print(f'Expected context: {context}')

Question: How high must the smoke be above the floor when firefighters put out a fire with water?

Expected context: Firecells located within 15 m of a relevant boundary that are not protected by an automatic fire sprinkler system, and that contain a fire load greater than 20 TJ or that have a floor area greater than 5,000 m2  must be designed and constructed so that at the time that firefighters first apply water to the fire, the maximum radiation flux at 1.5 m above the floor is no greater than 4.5 kW/m2 and the smoke layer is not less than 2 m above the floor.


In [13]:
# Encode the question
question_vector = model.encode(question)
question_vector.shape

(384,)

In [11]:
# Look for the top 3 closest matches - we use cosine similarity and gain all the scores in asc order.
# With the sorted scores we get the last 3 (top 3) and then flip for descending order.
# We then obtain from the data frame the clauses that match the top 3 scores

scores = util.cos_sim(np.array([question_vector]), vectors)[0]
top_score_ids = np.argsort(scores)[-3:].flip(0)
df.iloc[top_score_ids][['content']]

Unnamed: 0_level_0,content
clause,Unnamed: 1_level_1
C3.8,Firecells located within 15 m of a relevant bo...
C3.5,Buildings must be designed and constructed so ...
C3.1,Buildings must be designed and constructed so ...


As we can see clause C3.8 was the top match. However, we will use Qdrant, and it's client library to perform semantic search and have a nicer developer experience.