# Vector search techniques

### Setup

In [1]:
from manifesto_qa.app import vector_db

from langchain_openai import OpenAIEmbeddings

            your code to use Python client v4 `weaviate.WeaviateClient` connections and methods.

            For Python Client v4 usage, see: https://weaviate.io/developers/weaviate/client-libraries/python
            For code migration, see: https://weaviate.io/developers/weaviate/client-libraries/python/v3_v4_migration
            
  warn_deprecated(


In [2]:
weaviate_instance = vector_db.instance
weaviate_client = vector_db.client

### Typical vector search

Use the `similarity search` method to perform a vector search to retrieve the `k` most relevant documents from the vector database.

In [3]:
search_query = "What is reform's position on the european union?"

In [4]:
relevant_docs = weaviate_instance.similarity_search(query=search_query, k=3)

In [5]:
for idx ,doc in enumerate(relevant_docs):
    print(f"\nDocument {idx+1}:")
    print(doc.page_content[:150]+"...")


Document 1:
DRAFT
DRAFT
Brexit meant taking back control of our borders, laws and money. It should have allowed us to take our 
rightful place among the 168 other...

Document 2:
Leave the European Convention on Human Rights.
British laws and judges must never be overruled by a foreign court. We must be free to 
deport those we...

Document 3:
DRAFT
DRAFT
The UK’s constitutional arrangements need Reform.
 
We are ruled by an arrogant and out of touch elite. The two party system is broken. Th...


We can also retrieve the Document metadata from the returned results:

In [6]:
for d in relevant_docs:
    print(d.metadata)

{'page': 17, 'source': '/Users/longbe01/Documents/projects/llm-rag/data/Reform_UK_Contract_With_The_People.pdf'}
{'page': 17, 'source': '/Users/longbe01/Documents/projects/llm-rag/data/Reform_UK_Contract_With_The_People.pdf'}
{'page': 28, 'source': '/Users/longbe01/Documents/projects/llm-rag/data/Reform_UK_Contract_With_The_People.pdf'}


### Maximum marginal relevance  (MMR)

Use Maximimum marginal relevance (MMR) to achieve a balance between relevance and diversity in the search results.

Vary the value of lambda_multi between 0 and 1 (where 0 is maximum diversity)

In [7]:
weaviate_instance._embedding = OpenAIEmbeddings(model="text-embedding-ada-002")

In [8]:
docs_mmr = weaviate_instance.max_marginal_relevance_search(query=search_query, k=3, lambda_mult=0.5)

In [9]:
for idx ,doc in enumerate(docs_mmr):
    print(f"\nDocument {idx+1}:")
    print(doc.page_content[:150]+"...")


Document 1:
DRAFT
DRAFT
Brexit meant taking back control of our borders, laws and money. It should have allowed us to take our 
rightful place among the 168 other...

Document 2:
partyof.wales5151 partyof.walesWe believe that Wales would 
be best served by re-joining 
the European Union at an 
appropriate point in time, 
recogn...

Document 3:
The out-of-touch wasteful BBC is institutionally biased. The TV licence is taxation without 
representation. In a world of on-demand TV  People should...


### Keyword search

Perform keyword rather than vector similarity searches. This can be performed using the same `similarity_search` method but passing in the `alpha` variable.

Vary `alpha` between 0 (pure keyword search) and 1 (pure vector search). The default is `0.75` (so similarity search is in fact a hybrid search with a higher weighting on vector similarity).

In [10]:
docs_kw = weaviate_instance.similarity_search(query=search_query, k=3, alpha=0)

In [11]:
for idx ,doc in enumerate(docs_kw):
    print(f"\nDocument {idx+1}:")
    print(doc.page_content[:150]+"...")


Document 1:
DRAFT
DRAFT
Brexit meant taking back control of our borders, laws and money. It should have allowed us to take our 
rightful place among the 168 other...

Document 2:
Leave the European Convention on Human Rights.
British laws and judges must never be overruled by a foreign court. We must be free to 
deport those we...

Document 3:
DRAFT
DRAFT
The UK’s constitutional arrangements need Reform.
 
We are ruled by an arrogant and out of touch elite. The two party system is broken. Th...


And now trying a pure vector search. In this case it doesn't look like it's made any difference.

In [12]:
docs_vector = weaviate_instance.similarity_search(query=search_query, k=3, alpha=1)

In [13]:
for idx ,doc in enumerate(docs_vector):
    print(f"\nDocument {idx+1}:")
    print(doc.page_content[:150]+"...")


Document 1:
DRAFT
DRAFT
Brexit meant taking back control of our borders, laws and money. It should have allowed us to take our 
rightful place among the 168 other...

Document 2:
Leave the European Convention on Human Rights.
British laws and judges must never be overruled by a foreign court. We must be free to 
deport those we...

Document 3:
DRAFT
DRAFT
The UK’s constitutional arrangements need Reform.
 
We are ruled by an arrogant and out of touch elite. The two party system is broken. Th...


### Vector search with filter

Use a filter applied to the document metadata to search only those documents from a particular source. I can't seem to get this to work. 

In [44]:
from weaviate.classes.query import Filter

docs_filtered = weaviate_instance.similarity_search(
    query="What is the housing strategy?", 
    k=3, 
    filters=Filter.by_property("source").equal("/Users/longbe01/Documents/projects/llm-rag/data/Conservative-Manifesto-GE2024.pdf")
    # filter={"source":"/Users/longbe01/Documents/projects/llm-rag/data/Conservative-Manifesto-GE2024.pdf"}
)

In [45]:
for idx ,doc in enumerate(docs_filtered):
    print(f"\nDocument {idx+1}:")
    print(doc.page_content[:150]+"...")


Document 1:
housing sector:
Social housing: We want to see an end to 
competitive bidding for the social housing decarbonisation fund, so funding is there for  
a...

Document 2:
housing needs and 
increasing the housing stock 
will also reduce the numbers 
of individuals and families 
facing homelessness in 
Wales. In the Sene...

Document 3:
fundamentally change their approach to housing 
asylum seekers, ensuring accommodation is 
safe, suitable and dignified....


In [46]:
for doc in docs_filtered:
    print(doc.metadata)

{'page': 9, 'source': '/Users/longbe01/Documents/projects/llm-rag/data/Green-Party-2024-General-Election-Manifesto-Long-version_imprint.pdf'}
{'page': 31, 'source': '/Users/longbe01/Documents/projects/llm-rag/data/Plaid_Cymru_Maniffesto_2024_ENGLISH.pdf'}
{'page': 27, 'source': '/Users/longbe01/Documents/projects/llm-rag/data/2024-06-20b-SNP-General-Election-Manifesto-2024_interactive.pdf'}


### Quantify search similarity

Return a relevancy score alongside the retrieved document.

In [26]:
scored_docs = weaviate_instance.similarity_search_with_score(search_query, k=5)

for doc in scored_docs:
    print(f"{doc[1]:.3f}", ":", doc[0].page_content[:100] + "...")

0.832 : DRAFT
DRAFT
Brexit meant taking back control of our borders, laws and money. It should have allowed ...
0.831 : Leave the European Convention on Human Rights.
British laws and judges must never be overruled by a ...
0.830 : DRAFT
DRAFT
The UK’s constitutional arrangements need Reform.
 
We are ruled by an arrogant and out ...
0.830 : the Irish Sea. Northern Ireland is still in the EU’s single market for goods. British citizens in 
N...
0.829 : Richard Tice
Leader, Reform UKBritain needs Reform and Reform needs you.
Britain has so much potenti...


### Infer the metadata from the query itself

Say we want to filter the results to only include those from the relevant source manifesto, but rather than passing a filter have the filter be automatically generated based on the query, e.g. a query asking about the Liberal Democrats should only retrieve results from their manifesto. 

In [27]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [28]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The manifesto PDF the chunk is from",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the manifesto",
        type="integer",
    ),
]

In [29]:
document_content_description = "Manifestos"
llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)

retriever = SelfQueryRetriever.from_llm(
    llm,
    weaviate_instance,
    document_content_description,
    metadata_field_info,
    verbose=True
)

  warn_deprecated(


In [36]:
docs = retriever.get_relevant_documents(query="What is the lib dem policy on the european union?")

/Users/longbe01/Documents/projects/llm-rag/venv-llm-rag/lib/python3.10/site-packages/pydantic/main.py:1070: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/


In [37]:
for d in docs:
    print(d.metadata)

{'page': 104, 'source': '/Users/longbe01/Documents/projects/llm-rag/data/For_a_Fair_Deal_-_Liberal_Democrat_Manifesto_2024.pdf'}
{'page': 108, 'source': '/Users/longbe01/Documents/projects/llm-rag/data/For_a_Fair_Deal_-_Liberal_Democrat_Manifesto_2024.pdf'}
{'page': 87, 'source': '/Users/longbe01/Documents/projects/llm-rag/data/For_a_Fair_Deal_-_Liberal_Democrat_Manifesto_2024.pdf'}
{'page': 19, 'source': '/Users/longbe01/Documents/projects/llm-rag/data/For_a_Fair_Deal_-_Liberal_Democrat_Manifesto_2024.pdf'}


In [38]:
docs_tory_eu = retriever.get_relevant_documents(query="What are the conservative party plans on the european union?")

/Users/longbe01/Documents/projects/llm-rag/venv-llm-rag/lib/python3.10/site-packages/pydantic/main.py:1070: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/


In [39]:
for d in docs_tory_eu:
    print(d.metadata)

In [42]:
docs_green_eu = retriever.get_relevant_documents(query="What are the green party plans on the european union?")

/Users/longbe01/Documents/projects/llm-rag/venv-llm-rag/lib/python3.10/site-packages/pydantic/main.py:1070: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/


In [43]:
for d in docs_green_eu:
    print(d.metadata)

{'page': 44, 'source': '/Users/longbe01/Documents/projects/llm-rag/data/Green-Party-2024-General-Election-Manifesto-Long-version_imprint.pdf'}
{'page': 43, 'source': '/Users/longbe01/Documents/projects/llm-rag/data/Green-Party-2024-General-Election-Manifesto-Long-version_imprint.pdf'}
{'page': 35, 'source': '/Users/longbe01/Documents/projects/llm-rag/data/Green-Party-2024-General-Election-Manifesto-Long-version_imprint.pdf'}
{'page': 44, 'source': '/Users/longbe01/Documents/projects/llm-rag/data/Green-Party-2024-General-Election-Manifesto-Long-version_imprint.pdf'}
