# Ensemble Retriever

This ensemble retriever combine both dense and sparse retriever to retrieve more relevant results.  
Embedding and creating vectordatabase and search from the database like FAISS is a type of dense retrieval.  
bm25 is a type of sparse retriever which include splitting the queries and documents and search for similar words without embedding.  

In [28]:
#extract the important data
import pandas as pd
data=pd.read_csv('decoded_sacom.csv')
data.drop(["S Street Addr 1","S Street Addr 2","S State","S Postcode",'Phone','Mobile', 'Email', 'Website', 'Open Hours', 'Wheelchair Access','Toilets Access', 'Disabled Parking'],axis=1, inplace=True)
service = data.drop(['Subjects'],axis=1)
subject = data.drop(['Services'],axis=1)
service['Services'] = service['Services'].str.split('\n')
service = service.explode('Services')
subject['Subjects'] = subject['Subjects'].str.split(';')
subject = subject.explode('Subjects')
combined_df = pd.concat([service, subject])
combined_df = combined_df.reset_index(drop=True) #reset the index as when splitting and combined, the index are jumbled up
#combined_df.to_csv('filtered_sacom.csv', index=False)

In [29]:
location=data["S Suburb"].str.lower() #extract the location and council for metadata
council=data["Council"].str.lower()
data2=data.drop(['S Suburb','Council'],axis=1)

In [30]:
#create the documents 
def create_row_strings(df):
    row_strings = []  # Initialize an empty list to store row strings
    # Iterate over each row in the DataFrame
    for index, row in df.iterrows():
        # Initialize an empty list to store column name-value pairs
        column_value_pairs = []

        # Iterate over each column in the row
        for column_name, value in row.items():
            # Check if the value is not null (not NaN)
            if pd.notna(value):
                # Format column name and value as "column_name: value"
                column_value_pair = f"{column_name}: {value}"
                column_value_pairs.append(column_value_pair)

        # Join column name-value pairs with newline separator
        formatted_row_string = " \n".join(column_value_pairs)

        # Append the formatted row string to the list
        row_strings.append(formatted_row_string)

    return row_strings

# Create a list of row strings
doc_list= create_row_strings(data)
doc_list2= create_row_strings(data2)
#splited_list = create_row_strings(combined_df)


In [45]:
len(doc_list)

14378

In [41]:
doc_list[0]

'Org ID: 193932 \nOrg Name: RSL Ardrossan Sub Branch \nAKA: Ardrossan RSL; Returned & Services League Ardrossan \nS Suburb: Ardrossan \nServices: Welfare and pensions support for ex-servicemen and their families\nSocial and recreational activities\nCommemoration activities - ANZAC Day, Remembrance Day and other significant events \nOrg Type: Business \nLocal Community dir: Service Clubs \nSubjects: Ex-Defence Service Groups; Halls For Hire; Social & Activity Groups; Support & Resource Groups; Veterans \nPrimary Category: Recreation \nCouncil: Yorke Peninsula Council'

In [46]:
import json

# Use json.dump to write the list to a file
with open('sacommunity.json', 'w') as f:
    json.dump(doc_list, f)

In [31]:
#create langchain document
from langchain_core.documents import Document

def create_documents(lis):
    documents=[]
    for i in range(len(lis)):
        page=Document(page_content = lis[i])
        documents.append(page)
    return documents
    
docs = create_documents(doc_list) #without metadata
#docs2 = create_documents(doc_list2) #location and council into metadata
#splits = create_documents(splited_list)

In [32]:
print(docs[0])

page_content='Org ID: 193932 \nOrg Name: RSL Ardrossan Sub Branch \nAKA: Ardrossan RSL; Returned & Services League Ardrossan \nS Suburb: Ardrossan \nServices: Welfare and pensions support for ex-servicemen and their families\nSocial and recreational activities\nCommemoration activities - ANZAC Day, Remembrance Day and other significant events \nOrg Type: Business \nLocal Community dir: Service Clubs \nSubjects: Ex-Defence Service Groups; Halls For Hire; Social & Activity Groups; Support & Resource Groups; Veterans \nPrimary Category: Recreation \nCouncil: Yorke Peninsula Council'


##### Initialize sparse retriever(bm25) which retrieve directly from the documents

In [33]:
#!pip install rank_bm25
from langchain_community.retrievers import BM25Retriever
retriever = BM25Retriever.from_documents(docs, k=5) #return k results
retriever2 = BM25Retriever.from_documents(docs2, k=5)

In [34]:
retriever.invoke("financial assistance marion")

[Document(page_content='Org ID: 199358 \nOrg Name: Workskil Australia - Kangaroo Island \nS Suburb:  Kingscote \nServices: Access to Government assistance\nHelp with registering for ongoing assistance to help search for work, obtain financial assistance with training, clothes and other items. \nPeople who have registered to receive bush fire assistance payments can be assisted to access other support services.\nThose who have lost their job or have had reduced hours as a result of the bush fires can be connected with other support services.\nAccess to job vacancies on the Island.\nHelp can be offered to coordinate any ongoing mental health support, using financial assistance to cover costs. \nOrg Type: Community \nLocal Community dir: Employment Services \nSubjects: Employment Assistance Programs; Employment Counselling; Employment Services \nPrimary Category: Employment \nCouncil: Kangaroo Island Council'),
 Document(page_content='Org ID: 202974 \nOrg Name: Rural Business Support \nAK

In [35]:
retriever2.invoke("financial assistance marion")

[Document(page_content='Org ID: 199358 \nOrg Name: Workskil Australia - Kangaroo Island \nServices: Access to Government assistance\nHelp with registering for ongoing assistance to help search for work, obtain financial assistance with training, clothes and other items. \nPeople who have registered to receive bush fire assistance payments can be assisted to access other support services.\nThose who have lost their job or have had reduced hours as a result of the bush fires can be connected with other support services.\nAccess to job vacancies on the Island.\nHelp can be offered to coordinate any ongoing mental health support, using financial assistance to cover costs. \nOrg Type: Community \nLocal Community dir: Employment Services \nSubjects: Employment Assistance Programs; Employment Counselling; Employment Services \nPrimary Category: Employment', metadata={'location': ' kingscote', 'council': 'kangaroo island council'}),
 Document(page_content='Org ID: 202974 \nOrg Name: Rural Busi

##### Create a FAISS vector database by convert the documents into embeddings and add to the FAISS

In [9]:
import os
from langchain_google_genai import GoogleGenerativeAIEmbeddings
os.environ["GOOGLE_API_KEY"] = "insert your api key here"
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

from langchain_community.vectorstores import FAISS
# faiss = FAISS.from_documents(doc, embeddings)
# faiss.save_local("faiss_index2") #save to local folder

##### Initialise dense retriever (FAISS)

In [12]:
faiss=FAISS.load_local("faiss_index2", embeddings, allow_dangerous_deserialization=True)
faiss_retriever = faiss.as_retriever(search_kwargs={"k": 10}) #return 10 results

In [18]:
faiss_retriever.invoke("hall for hire marion")

[Document(page_content='Org ID: 204141 \nOrg Name: Marion Tennis Club \nS Suburb: Marion \nOrg Type: Community \nLocal Community dir: Tennis \nPrimary Category: Recreation \nCouncil: City of Marion \nSubjects:  Halls For Hire'),
 Document(page_content='Org ID: 203702 \nOrg Name: Marion Bowling Club \nS Suburb: Marion \nOrg Type: Community \nLocal Community dir: Bowling \nPrimary Category: Recreation \nCouncil: City of Marion \nSubjects:  Halls For Hire'),
 Document(page_content='Org ID: 217527 \nOrg Name: District Council of Karoonda East Murray  \nS Suburb: Karoonda \nOrg Type: Government \nLocal Community dir: Local Government \nPrimary Category: Government \nCouncil: Karoonda East Murray \nSubjects:  Halls For Hire'),
 Document(page_content='Org ID: 203662 \nOrg Name: Active Elders Association Inc. \nS Suburb: Ascot Park \nOrg Type: Community \nLocal Community dir: Seniors \nPrimary Category: Recreation \nCouncil: City of Marion \nSubjects:  Halls For Hire'),
 Document(page_content=

In [14]:
from langchain.retrievers import EnsembleRetriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[retriever, faiss_retriever], weights=[0.5, 0.5]
)

In [25]:
result=ensemble_retriever.invoke("quit smoking")

In [26]:
len(result) #10 from bm25, 10 from FAISS

15

In [27]:
result

[Document(page_content='Org ID: 202205 \nOrg Name: Quitline \nAKA: OxyGen \nS Suburb: Eastwood \nServices: Confidential free telephone information, counselling and support service for people who want to quit smoking cigarettes and other tobacco products \nOrg Type: Community \nLocal Community dir: Drug & Alcohol Services \nSubjects: Smoking ; Treatment \nPrimary Category: Health & Disability \nCouncil: City of Burnside', metadata={'location': 'eastwood', 'council': 'city of burnside'}),
 Document(page_content='Org ID: 228297 \nOrg Name: PsychMed - City \nS Suburb: Adelaide \nOrg Type: Business \nLocal Community dir: Mental Health Services \nPrimary Category: Health & Disability \nCouncil: City of Adelaide \nSubjects:  Smoking '),
 Document(page_content="Org ID: 201607 \nOrg Name: Lyell McEwin Hospital \nAcronym: LMH \nS Suburb: Elizabeth Vale \nServices: Emergency medical care/casualty\nMedical and surgical care\nPsychiatric care\nPaediatric care\nAllied health services - dietetics, oc

#### Result are what we are expected and a lot improvement compared to just using FAISS 