<a href="https://colab.research.google.com/github/ferragina/MyInformationRetrieval/blob/main/6_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Weaviate as a Search Engine

In [1]:
!pip install -U weaviate-client
import weaviate
import weaviate.classes.config as wc

Collecting weaviate-client
  Downloading weaviate_client-4.11.3-py3-none-any.whl.metadata (3.7 kB)
Collecting validators==0.34.0 (from weaviate-client)
  Downloading validators-0.34.0-py3-none-any.whl.metadata (3.8 kB)
Collecting authlib<1.3.2,>=1.2.1 (from weaviate-client)
  Downloading Authlib-1.3.1-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting grpcio-tools<2.0.0,>=1.66.2 (from weaviate-client)
  Downloading grpcio_tools-1.71.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Collecting grpcio-health-checking<2.0.0,>=1.66.2 (from weaviate-client)
  Downloading grpcio_health_checking-1.71.0-py3-none-any.whl.metadata (1.0 kB)
Downloading weaviate_client-4.11.3-py3-none-any.whl (353 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m353.9/353.9 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading validators-0.34.0-py3-none-any.whl (43 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.5/43.5 kB[0m [31m2.3 MB/s

In [3]:
import weaviate
from weaviate.classes.query import MetadataQuery
from weaviate.classes.config import Configure, Property, DataType, Tokenization
from weaviate.classes.query import Filter

client = weaviate.connect_to_embedded()

INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 2771


Let's create a simple collection that has just one field of texts.  

In [78]:
client.collections.delete_all()
client.collections.create(
    name="TestCollection",
    properties=[
        wc.Property(name="text", data_type=wc.DataType.TEXT),
    ]
)

<weaviate.collections.collection.sync.Collection at 0x7dee565e0a90>

Here is a list of simple documents that are useful to test some simple queries

In [5]:
sample_docs = [
    {"text": "Trump u.s.a. NATO"},
    {"text": "trump usa N.A.T.O."},
    {"text": "trump u s a NATO"},
    {"text": "the cat sleeps"},
    {"text": "u are a star"}
]

Now we create the collection and we insert the samples

In [6]:
documents = client.collections.get("TestCollection")
for doc in sample_docs:
    documents.data.insert(doc)

Here is how to iterate over all documents in the collection

In [17]:
# retrieve the elements
for i, doc in enumerate(documents.iterator()):
  print(doc.uuid, " - ", doc.properties["text"])

21786351-d612-4d2a-97de-18a45fcb813b  -  Trump u.s.a. NATO
3f438232-fff8-4f7f-908a-ea6618b5c003  -  u are a star
7cb41e29-7eea-4137-b04b-631463af267f  -  the cat sleeps
853cc7eb-6a48-46b6-920a-a6ac9a022a5a  -  trump u s a NATO
e25209be-3b53-4d21-a7b0-5ae67830194a  -  trump usa N.A.T.O.


Let's try some simple queries, bm25 is the vectorization textual technique that we saw in lecture 2 (better than TFIDF). This means that the following query is processed textually.

In [25]:
query = "u"
response = documents.query.bm25(query=query, return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["text"]))

0.26 - u are a star
0.24 - trump u s a NATO
0.24 - Trump u.s.a. NATO


Unfortunately, words are not stemmed, but are lowercased. This is on the roadmap of features that Weaviate plans to support in the future.

Let's also define a function that properly prints the results of a query

In [26]:
def print_query_results(query, prop_name, collection):
  print("QUERY:: {}\n".format(query))
  response = collection.query.bm25(query=query, return_metadata=MetadataQuery(score=True))
  for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties[prop_name]))

In [27]:
print_query_results("TRUMP", "text", documents) #the words are lowercased

QUERY:: TRUMP

0.24 - trump u s a NATO
0.24 - Trump u.s.a. NATO
0.22 - trump usa N.A.T.O.


In [28]:
print_query_results("Trump", "text", documents) #the words are lowercased

QUERY:: Trump

0.24 - trump u s a NATO
0.24 - Trump u.s.a. NATO
0.22 - trump usa N.A.T.O.


In [29]:
print_query_results("the", "text", documents) # the stopwords are not present by assuming English

QUERY:: the



Now we define a function that shows some very basic queries, but that are able

In [30]:
def example_queries(prop_name, collection):
    queries = ["She is sleeping", "I sleep", "the usa", "I live in the u.s.a.", "TRUMP"]
    for query in queries:
      print_query_results(query, prop_name, collection)
      print("===============================================================")
      print()

In [31]:
print(sample_docs)
print("\n")
example_queries("text", documents)

[{'text': 'Trump u.s.a. NATO'}, {'text': 'trump usa N.A.T.O.'}, {'text': 'trump u s a NATO'}, {'text': 'the cat sleeps'}, {'text': 'u are a star'}]


QUERY:: She is sleeping


QUERY:: I sleep


QUERY:: the usa

0.56 - trump usa N.A.T.O.

QUERY:: I live in the u.s.a.

0.62 - trump u s a NATO
0.62 - Trump u.s.a. NATO
0.26 - u are a star

QUERY:: TRUMP

0.24 - trump u s a NATO
0.24 - Trump u.s.a. NATO
0.22 - trump usa N.A.T.O.



But how is the input really treated? How is it tokenized?

**TOKENIZATION OPTIONS**
* word: alphanumeric, lowercased tokens, with stopwords filtering (default tokenizer for Weaviate)
* lowercase: lowercased tokens
* whitespace: whitespace-separated, case-sensitive tokens
* field: the entire value of the property is treated as a single token

In [79]:
client.collections.create(
    name="TestWhitespace",
    properties=[
        wc.Property(name="text", data_type=wc.DataType.TEXT, tokenization=Tokenization.WHITESPACE),
    ],
)

<weaviate.collections.collection.sync.Collection at 0x7dee63bdc5d0>

In [42]:
documents = client.collections.get("TestWhitespace")
for doc in sample_docs:
    documents.data.insert(doc)

In [None]:
print_query_results("the", "text", documents) # stopword is found

QUERY:: the

0.68 - the cat sleeps


In [39]:
print_query_results("Trump", "text", documents) # no lowercasing, thus not find "trump"

QUERY:: Trump

0.68 - Trump u.s.a. NATO


In [40]:
print_query_results("trump", "text", documents) # no lowercasing, thus not find "trump" and "Trump"

QUERY:: trump

0.43 - trump usa N.A.T.O.
0.34 - trump u s a NATO


In [None]:
print_query_results("u", "text", documents) # whitespace does not split "u.s.a." which is one token

QUERY:: u

0.38 - u are a star
0.34 - trump u s a NATO


In [37]:
print_query_results("u.s.a.", "text", documents)

QUERY:: u.s.a.

0.68 - Trump u.s.a. NATO


In [34]:
example_queries("text", documents)

QUERY:: She is sleeping


QUERY:: I sleep


QUERY:: the usa

0.68 - trump usa N.A.T.O.
0.68 - the cat sleeps

QUERY:: I live in the u.s.a.

0.68 - Trump u.s.a. NATO
0.68 - the cat sleeps

QUERY:: TRUMP




In [43]:
client.collections.create(
    name="TestLowercase",
    properties=[
        wc.Property(name="text", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
    ],
)

documents = client.collections.get("TestLowercase")
for doc in sample_docs:
    documents.data.insert(doc)

example_queries("text", documents)

QUERY:: She is sleeping


QUERY:: I sleep


QUERY:: the usa

0.68 - the cat sleeps
0.68 - trump usa N.A.T.O.

QUERY:: I live in the u.s.a.

0.68 - Trump u.s.a. NATO
0.68 - the cat sleeps

QUERY:: TRUMP

0.26 - Trump u.s.a. NATO
0.26 - trump usa N.A.T.O.
0.21 - trump u s a NATO



## Properties
Let's now add some simple properties to our index. As of now we only handled the "text" property, containing some simple textual snippets.

In [80]:
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/5articles.json
import json

with open("5articles.json", 'r') as f:
  articles = json.load(f)

--2025-03-27 17:16:33--  https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/5articles.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12566 (12K) [text/plain]
Saving to: ‘5articles.json.3’


2025-03-27 17:16:33 (20.3 MB/s) - ‘5articles.json.3’ saved [12566/12566]



In [81]:
articles[0]

{'title': 'American Airlines orders 60 Overture supersonic jets',
 'maintext': "The revival of supersonic passenger travel, thought to be long dead with the demise of Concorde nearly two decades ago, could be about to take wing as American Airlines has put in an order for 60 aircraft capable of flying at 1.7 times the speed of sound. \nBoom is a start-up based in Denver, Colorado, whose development of Overture, an ultra-fast successor to Concorde that seats 65 to 88 passengers, is so advanced that it showed off designs at last month's Farnborough air show.",
 'date': '2022-08-18',
 'source': 'The New York Times'}

In [82]:
client.collections.create(
    name="TestProperties",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="title", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
    ],
)

<weaviate.collections.collection.sync.Collection at 0x7dee565e3f10>

In [83]:
documents = client.collections.get("TestProperties")
for doc in articles:
    documents.data.insert({"maintext": doc["maintext"], "title": doc["title"]})

In [84]:
for doc in documents.iterator():
  print(doc.uuid, " - ", doc.properties)

46130432-de46-4593-ae06-3d30cfedf26b  -  {'maintext': 'Antonio Conte. Pic: PA\nHead coach Antonio Conte does not think Chelsea are in the race to sign Arsenal forward Alexis Sanchez.\nSanchez is out of contract this summer and seemed certain to join Manchester City this month.\nBut the Premier League leaders on Monday evening decided to end their interest because of the costs involved, with Manchester United in pole position, while there were suggestions the Premier League champions were also in the running.\nConte last Friday spoke of his admiration for Sanchez and described any potential cut-priced deal for the Chile striker as a great opportunity.\nThe Italian was evasive when quizzed on Chelsea\'s interest in the player, taking his usual stance in deferring matters of recruitment to the club.\nAsked if Chelsea were actively seeking to sign Sanchez, Conte said: "I don\'t know. I don\'t think so. I don\'t know, but I don\'t think so."\nConte, speaking ahead of tonight\'s FA Cup third

In [49]:
print_query_results("mother", "title", documents) # prints the score and the title of the retrieved article

QUERY:: mother

0.52 - 'One-punch killer's sentence will make others think twice'
0.3 - Leclerc dedicates win to Hubert


In [50]:
print_query_results("cars", "title", documents) # There is no stemming, indeed, thus the next article is not returned

QUERY:: cars

0.48 - Leclerc dedicates win to Hubert


In [52]:
print_query_results("Leclerc", "title", documents) # The score can be larger than 1

QUERY:: Leclerc

2.04 - Leclerc dedicates win to Hubert


Say that you now want to consider some words as "stopwords", that the system does not consider as such by default

In [53]:
print_query_results("victory", "title", documents) #As above, but below we classify it as a stopword

documents.config.update(inverted_index_config=wc.Reconfigure.inverted_index(stopwords_additions=["victory"]))

print("\n")
print_query_results("victory", "title", documents)

QUERY:: victory

0.71 - Leclerc dedicates win to Hubert


QUERY:: victory



**EXERCISE:** Load 500news.json
Index everything in
*   maintext (using word tokenizer)
*   source (using lowercase tokenizer)
*   author (using field tokenizer)



In [57]:
client.collections.create(
    name="Test500news",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="source", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
        wc.Property(name="author", data_type=wc.DataType.TEXT, tokenization=Tokenization.FIELD),
    ],
)

<weaviate.collections.collection.sync.Collection at 0x7dee56439710>

In [59]:
!wget https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/500news.json
import json

with open("500news.json", 'r') as f:
  art500 = json.load(f)

--2025-03-27 16:43:56--  https://raw.githubusercontent.com/LorenzoBellomo/InformationRetrieval/refs/heads/main/data/500news.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 147867 (144K) [text/plain]
Saving to: ‘500news.json’


2025-03-27 16:43:56 (4.15 MB/s) - ‘500news.json’ saved [147867/147867]



In [60]:
documents = client.collections.get("Test500news")
for doc in art500:
    documents.data.insert({
        "maintext": doc["maintext"],
        "source": doc["source"],
        "author": doc["author"]
    })

In [69]:
print_query_results("Melchiorre Paccioretti", "maintext", documents)

QUERY:: Melchiorre Paccioretti

1.21 - Today the Senate Budget Committee examines the Maneuver. Fico writes to Casellati: concern for exam times
1.21 - The traditional party for lighting up the lights
1.21 - After the controversy the executive corrects the norm on plastic. Medical devices and single-use plastic articles used to contain and protect medicinal preparations are excluded from the payment of the tax. Italia Viva: “Moves forward but we are not satisfied”
1.21 - Dmitry Obretetskiy, 49 years old, was walking the dog. The latest in a series of 'mysterious' deaths of Russian characters in Britain. Police investigation continues
1.21 - Russian President Vladimir Putin, in a joke torn by journalists, said he was “satisfied” with his face to face with Ukrainian counterpart Volodymyr Zelensky. Previewed a new meeting in four months
1.21 - Prime Minister received the award today but ethnic violence is not ending in Ethiopia
1.21 - Now the decree passes to the Senate, must be converted

In [74]:
# search on a specific property
response = documents.query.bm25(
    query="repubblica",
    query_properties=["maintext"], # this is the line to add
    return_metadata=MetadataQuery(score=True)
)
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["maintext"]))

2.34 - According to La Repubblica, Azzolina copied part of her thesis from specialized texts. “A minister like this has no right to give (and do) lessons” says the leader of Carroccio


In [75]:
response = documents.query.bm25(
    query="repubblica",
    return_metadata=MetadataQuery(score=True)
)
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["maintext"]))

2.28 - According to La Repubblica, Azzolina copied part of her thesis from specialized texts. “A minister like this has no right to give (and do) lessons” says the leader of Carroccio
0.94 - 
0.94 - According to French media, in the Paris region alone, the Ile de France, the queues together amount to 500-600 kilometers
0.94 - Among the causes of the flames, the malfunctioning of the electrical system that, would not have been changed since 1966.
0.94 - We are working on a solution at the OECD level but if we do not achieve a result there is a mandate for a European agreement to tax the giants of the web, said Commissioner
0.94 - Yesterday in Brussels it was decided to postpone until the beginning of 2020 the agreement on the reform of the ESM and the Minister of Economy, Roberto Gualtieri, said he was more confident to be able to bring home an agreement also in Italy
0.94 - 
0.94 - On the slopes of the volcano there were tourists
0.94 - The 23-year-old was at the train station to go to

But fields in the query are not all "born equal". Some are more important than others (e.g., title). Let's boost the importance of the "title" field (by scaling its score count by two)

In [99]:
response = documents.query.bm25(
    query="race",
    return_metadata=MetadataQuery(score=True)
)
print("BEFORE FIELD BOOSTING: (query = race)\n")
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

BEFORE FIELD BOOSTING: (query = race)

1.27 - Conte: 'Chelsea are not in the race to sign Sanchez'
0.54 - Leclerc dedicates win to Hubert


In [100]:
response = documents.query.bm25(
    query="race",
    query_properties=["title^2", "maintext"],
    return_metadata=MetadataQuery(score=True)
)
print("AFTER FIELD BOOSTING: (query = race)\n")
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

AFTER FIELD BOOSTING: (query = race)

1.43 - Conte: 'Chelsea are not in the race to sign Sanchez'
0.54 - Leclerc dedicates win to Hubert


The score is not double the score, because:
- it is not using TF-IDF, but BM25 which scales slightly different
- "race" is also present inside the maintext of the article

In [101]:
response.objects[0].properties["maintext"].count("race") # indeed it appears once

1

Add some basic filtering

In [102]:
response = documents.query.bm25(
    query="race",
    return_metadata=MetadataQuery(score=True)
)
print("BEFORE FILTERING: (query = race)\n")
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

BEFORE FILTERING: (query = race)

1.27 - Conte: 'Chelsea are not in the race to sign Sanchez'
0.54 - Leclerc dedicates win to Hubert


In [106]:
response = documents.query.bm25(
    query="race",
    filters=Filter.by_property("title").contains_any(["Leclerc", "formula"]),
    return_metadata=MetadataQuery(score=True)
)
print("AFTER FILTERING: (query = race)\n")
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

AFTER FILTERING: (query = race)

0.54 - Leclerc dedicates win to Hubert


Let's see what happens when we also add dates as properties

In [108]:
client.collections.create(
    name="TestDate",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="title", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
        wc.Property(name="date", data_type=wc.DataType.DATE)
    ]
)

UnexpectedStatusCodeError: Collection may not have been created properly.! Unexpected status code: 422, with response body: {'error': [{'message': 'class name TestDate already exists'}]}.

[All property types](https://weaviate.io/developers/weaviate/config-refs/datatypes)

In [110]:
from datetime import timezone, datetime
documents = client.collections.get("TestDate")
for doc in articles:
    documents.data.insert({
        "maintext": doc["maintext"],
        "title": doc["title"],
        "date": datetime.strptime(doc["date"], "%Y-%m-%d").replace(tzinfo=timezone.utc)
    })

In [111]:
for doc in documents.iterator():
  print(doc.uuid, " - ", doc.properties['date'], '  ', doc.properties['title'])
  # print(doc.uuid, " - ", doc.properties)

0c987ac2-adc3-4ca2-ae5f-4e775220e663  -  2019-06-07 00:00:00+00:00    Gunman opens fire on car just metres from scene of Hamid Sanambar murder
4af52286-21b7-4e9c-ae05-12f679e5dedb  -  2022-08-18 00:00:00+00:00    American Airlines orders 60 Overture supersonic jets
54495fad-8775-49e7-984d-bedaa9d25bee  -  2019-09-01 00:00:00+00:00    Leclerc dedicates win to Hubert
70aa9392-c5fb-4749-bcd3-d93638af9343  -  2018-01-23 00:00:00+00:00    Conte: 'Chelsea are not in the race to sign Sanchez'
eed3442f-b9d1-4c16-ba6a-9ed3e9162ad6  -  2019-06-29 00:00:00+00:00    'One-punch killer's sentence will make others think twice'


In [112]:
response = documents.query.bm25(
    query="race",
    return_metadata=MetadataQuery(score=True)
)
print("BEFORE FILTERING: (query = race)\n")
for o in response.objects:
    print("{} - {} ({})".format(round(o.metadata.score*100)/100, o.properties["title"], o.properties["date"]))

BEFORE FILTERING: (query = race)

1.27 - Conte: 'Chelsea are not in the race to sign Sanchez' (2018-01-23 00:00:00+00:00)
0.54 - Leclerc dedicates win to Hubert (2019-09-01 00:00:00+00:00)


In [113]:
reference_date = datetime.strptime("2019-08-15", "%Y-%m-%d").replace(tzinfo=timezone.utc)
response = documents.query.bm25(
    query="race",
    filters=Filter.by_property("date").greater_or_equal(reference_date),
    return_metadata=MetadataQuery(score=True)
)
print("AFTER FILTERING: (query = race)\n")
for o in response.objects:
    print("{} - {} ({})".format(round(o.metadata.score*100)/100, o.properties["title"], o.properties["date"]))

AFTER FILTERING: (query = race)

0.54 - Leclerc dedicates win to Hubert (2019-09-01 00:00:00+00:00)


Some advanced features, let's try some vectorized queries. Let's assume we want to find all articles that are "related to sport". In this current collection, "sport" is not present as a word in any title or maintext.

In [114]:
response = documents.query.bm25(query="sport", return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

Let's install some textual vectorizer to run some semantic search queries.

In [115]:
# Unfortunately, we cannot use all the vectorizer modules that are present in Weaviate. Here is a list of the ones that are available
client.get_meta()

{'hostname': 'http://127.0.0.1:8079',
 'modules': {'generative-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
   'name': 'Generative Search - OpenAI'},
  'qna-openai': {'documentationHref': 'https://platform.openai.com/docs/api-reference/completions',
   'name': 'OpenAI Question & Answering Module'},
  'ref2vec-centroid': {},
  'reranker-cohere': {'documentationHref': 'https://txt.cohere.com/rerank/',
   'name': 'Reranker - Cohere'},
  'text2vec-cohere': {'documentationHref': 'https://docs.cohere.ai/embedding-wiki/',
   'name': 'Cohere Module'},
  'text2vec-huggingface': {'documentationHref': 'https://huggingface.co/docs/api-inference/detailed_parameters#feature-extraction-task',
   'name': 'Hugging Face Module'},
  'text2vec-openai': {'documentationHref': 'https://platform.openai.com/docs/guides/embeddings/what-are-embeddings',
   'name': 'OpenAI Module'}},
 'version': '1.26.6'}

Let's use COHERE as a textual vectorizer [https://dashboard.cohere.com/api-keys](https://dashboard.cohere.com/api-keys). As we can see, using colab we have only a few options for vectorization (openai, cohere, huggingface). Additionally, only one generation model is available (openai).
Cohere provides free sample apis. OpenAI does not.

In [116]:
## You need first to create a KEY !!!!
from google.colab import userdata

client.close()
cohere_key = userdata.get('COHERE_KEY') # MAKE SURE YOU CREATED A KEY
headers = {
    "X-Cohere-Api-Key": cohere_key,
}
client = weaviate.connect_to_embedded(headers=headers)

INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 36523


HERE TO CHECK HOW TO INTEGRATE MODELS [https://weaviate.io/developers/weaviate/model-providers](https://weaviate.io/developers/weaviate/model-providers)

Now we create the example collection. Please note that we set here the vectorizer (cohere) and the generation module for an experiment that we will do later (openai, only availabe on the paid model).

In [117]:
client.collections.delete_all()
client.collections.create(
    name="TestVectorizer",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="title", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
    ],
    vectorizer_config=[
        Configure.NamedVectors.text2vec_cohere(
            name="maintext_vector",
            source_properties=["maintext"],
            #model="embed-multilingual-light-v3.0"
        )
    ]
)

<weaviate.collections.collection.sync.Collection at 0x7dee63ad54d0>

In [118]:
documents = client.collections.get("TestVectorizer")
for doc in articles:
    documents.data.insert({"maintext": doc["maintext"], "title": doc["title"]}) # here weaviate performs the vectorization

In [119]:
print("pure syntactical search (ordered by decreasing similarity score): 'sport'\n")
response = documents.query.bm25(query="sport", return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

pure syntactical search (ordered by decreasing similarity score): 'sport'



In [120]:
print("pure vector search (ordered by increasing distance): 'sport'\n")
# NOTE THAT WE ALSO NEED THE PARAMETER DISTANCE
response = documents.query.near_text(query="sport", return_metadata=MetadataQuery(score=True, distance=True), limit=3)
for o in response.objects:
  print("{} - {} (score is {})".format(round(o.metadata.distance*100)/100, o.properties["title"], round(o.metadata.score*100)/100))

pure vector search (ordered by increasing distance): 'sport'

0.6 - Leclerc dedicates win to Hubert (score is 0.0)
0.61 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder (score is 0.0)
0.65 - Conte: 'Chelsea are not in the race to sign Sanchez' (score is 0.0)


In [136]:
print("pure syntactical search (ordered by decreasing similarity score): 'race'\n")
response = documents.query.bm25(query="race", return_metadata=MetadataQuery(score=True))
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

pure syntactical search (ordered by decreasing similarity score): 'race'

1.27 - Conte: 'Chelsea are not in the race to sign Sanchez'
0.54 - Leclerc dedicates win to Hubert


In [122]:
print("pure vector search (ordered by increasing distance): 'race'\n")
# NOTE THAT WE ALSO NEED THE PARAMETER DISTANCE
response = documents.query.near_text(query="race", return_metadata=MetadataQuery(score=True, distance=True), limit=3)
for o in response.objects:
  print("{} - {} (score is {})".format(round(o.metadata.distance*100)/100, o.properties["title"], round(o.metadata.score*100)/100))

pure vector search (ordered by increasing distance): 'race'

0.6 - Leclerc dedicates win to Hubert (score is 0.0)
0.61 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder (score is 0.0)
0.69 - Conte: 'Chelsea are not in the race to sign Sanchez' (score is 0.0)


In [124]:
print("hybrid search (ordered by decreasing score): 'race'")
response = documents.query.hybrid(query="race", alpha=0.5, return_metadata=MetadataQuery(score=True, explain_score=True), limit=3)
for o in response.objects:
  print("{} - {} [{}]".format(round(o.metadata.score*100)/100, o.properties["title"],  o.metadata.explain_score.strip().replace("\n", '')))

hybrid search (ordered by decreasing score): 'race'
0.6 - Conte: 'Chelsea are not in the race to sign Sanchez' [Hybrid (Result Set keyword,bm25) Document 6678b75b-3104-45e3-a8d8-302a627c57bf: original score 1.2714014, normalized score: 0.5 - Hybrid (Result Set vector,hybridVector) Document 6678b75b-3104-45e3-a8d8-302a627c57bf: original score 0.31254613, normalized score: 0.10011594]
0.5 - Leclerc dedicates win to Hubert [Hybrid (Result Set keyword,bm25) Document c396d041-4307-48f7-ba71-f1bf2bb2c29f: original score 0.5364737, normalized score: 0 - Hybrid (Result Set vector,hybridVector) Document c396d041-4307-48f7-ba71-f1bf2bb2c29f: original score 0.39786047, normalized score: 0.5]
0.46 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder [Hybrid (Result Set vector,hybridVector) Document 3a8c833e-6787-4091-ba84-604b4247f01f: original score 0.38975292, normalized score: 0.46199843]


[Description of how scoring works](https://weaviate.io/developers/weaviate/concepts/search/hybrid-search)

## A new method, RAG
RAG stands for Retrieval Augmented Generation. This is a recent trend in Information Retrieval that aims at reducing the problem of "hallucinations" for Large Language Model generation, and returns better answers based on local document archives.
- Traditional queries go as follows: the user makes a query to a search engine; the search engine returns, in some predefined format, the answer to that query.
- LLM queries: the user makes a query to a Large Language Model (LLM); the LLM creates an answer based on the (often unspecified) training data that was originally used to train it. The LLM often hallucinates, returing wrong answers.
- RAG: the user makes a query to a search engine; the search engine runs the query and gets its results. Before returning the results to the user, they are sent to a LLM to "process" and generate a textual response that is more convenient to read for the user, but (ideally) does not contain hallucinated information because they use precomputed (retrieved) results.

https://weaviate.io/developers/weaviate/model-providers

Now let's try to include some generative AI prompts to this query (let's add context to the entities in the news, or let's translate them in Italian).
Note that this query will only work for those who have an openai paid module.

In [125]:
client.close()
cohere_key = userdata.get('COHERE_KEY')
openai_key = userdata.get("OPENAI_KEY2")
headers = {
    "X-Cohere-Api-Key": cohere_key,
    "X-OpenAI-Api-Key": openai_key
}
client = weaviate.connect_to_embedded(headers=headers)

INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 41689


In [126]:
client.collections.delete_all()
client.collections.create(
    name="TestVectorizer",
    properties=[
        wc.Property(name="maintext", data_type=wc.DataType.TEXT, tokenization=Tokenization.WORD),
        wc.Property(name="title", data_type=wc.DataType.TEXT, tokenization=Tokenization.LOWERCASE),
    ],
    vectorizer_config=[
        Configure.NamedVectors.text2vec_cohere(
            name="maintext_vector",
            source_properties=["maintext"],
            #model="embed-multilingual-light-v3.0"
        )
    ],
    generative_config=Configure.Generative.openai(model="gpt-4") # added generation module
)

<weaviate.collections.collection.sync.Collection at 0x7dee56429d50>

In [127]:
documents = client.collections.get("TestVectorizer")
for doc in articles:
    documents.data.insert({"maintext": doc["maintext"], "title": doc["title"]}) # here weaviate performs the vectorization

In [128]:
response = documents.generate.near_text(
    query="sport",  # The model provider integration will automatically vectorize the query
    single_prompt="Write a short summary of maximum 100 characters in Italian of {maintext}",
    limit=2 # apply LLM to the top 2 results
)

In [129]:
for obj in response.objects:
    print(obj.properties["title"])
    print(f"Generated output: {obj.generated}")  # Note that the generated output is per object
    print("====================================================")
    print()

Leclerc dedicates win to Hubert
Generated output: Charles Leclerc ha ottenuto la sua prima vittoria in Formula Uno al Gran Premio del Belgio, dedicandola ad Anthoine Hubert.

Gunman opens fire on car just metres from scene of Hamid Sanambar murder
Generated output: La polizia cerca un uomo armato che ha sparato su un'auto a Dublino, vicino al luogo dove Hamid Sanambar è stato ucciso.



The code above implements RAG using an external LLM module (OpenAI), invoked via the internal Weaviate module. We can also implement a RAG by calling the LLM directly, by using Cohere to implement the vectorization (inside Weaviate) and the generation (direcly with an API call). This way we do not need to pay for an OpenAI API key.

In [130]:
!pip install cohere

Collecting cohere
  Downloading cohere-5.14.0-py3-none-any.whl.metadata (3.4 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting httpx-sse==0.4.0 (from cohere)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting types-requests<3.0.0,>=2.0.0 (from cohere)
  Downloading types_requests-2.32.0.20250306-py3-none-any.whl.metadata (2.3 kB)
Downloading cohere-5.14.0-py3-none-any.whl (253 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.9/253.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)
Downloading fastavro-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading types_requests-2.32.0.20250306-py3-none-any.whl (20 kB)
Inst

In [131]:
print("hybrid search: 'race'")
response = documents.query.hybrid(query="race", alpha=0.5, return_metadata=MetadataQuery(score=True, explain_score=True), limit=3)
for o in response.objects:
  print("{} - {} [{}]".format(round(o.metadata.score*100)/100, o.properties["title"],  o.metadata.explain_score.strip().replace("\n", '')))

hybrid search: 'race'
0.6 - Conte: 'Chelsea are not in the race to sign Sanchez' [Hybrid (Result Set keyword,bm25) Document 96804a56-e56d-4959-89a6-94a865a8335d: original score 1.2714014, normalized score: 0.5 - Hybrid (Result Set vector,hybridVector) Document 96804a56-e56d-4959-89a6-94a865a8335d: original score 0.31205738, normalized score: 0.09897505]
0.5 - Leclerc dedicates win to Hubert [Hybrid (Result Set keyword,bm25) Document 76486655-ce5a-4aee-9325-d12e7231804a: original score 0.5364737, normalized score: 0 - Hybrid (Result Set vector,hybridVector) Document 76486655-ce5a-4aee-9325-d12e7231804a: original score 0.39708614, normalized score: 0.5]
0.46 - Gunman opens fire on car just metres from scene of Hamid Sanambar murder [Hybrid (Result Set vector,hybridVector) Document 1331788c-20cd-4e93-a320-d643a7dbfb60: original score 0.3893391, normalized score: 0.46346223]


In [134]:
import cohere

co = cohere.ClientV2(api_key=cohere_key)
res = co.chat(
    model="command-r-plus-08-2024", # this is a cohere model
    messages=[
        {
            "role": "user",
            "content": "Write a short summary (100 characters at max), in Italian of the textual article \
            provided below: \n\n {}".format(response.objects[1].properties["maintext"]),
        } # response includes all the results returned by the previous hybrid query
    ],
)

print(res.message.content[0].text)

Charles Leclerc vince il Gran Premio del Belgio, dedicando la vittoria ad Anthoine Hubert, suo amico scomparso.


**EXERCISE**:
Given the bm25 purely textual query with "race", provide a "one-sentence" summary in French of the maintext of the top article.

In [137]:
print("pure syntactical search (ordered by decreasing similarity score): 'race'\n")
response = documents.query.bm25(query="race", return_metadata=MetadataQuery(score=True), limit=1)
for o in response.objects:
    print("{} - {}".format(round(o.metadata.score*100)/100, o.properties["title"]))

pure syntactical search (ordered by decreasing similarity score): 'race'

1.27 - Conte: 'Chelsea are not in the race to sign Sanchez'


In [138]:
res = co.chat(
    model="command-r-plus-08-2024", # this is a cohere model
    messages=[
        {
            "role": "user",
            "content": "Write a one sentence summary in French of the textual article \
            provided below: \n\n {}".format(response.objects[0].properties["maintext"]),
        }
    ],
)

print(res.message.content[0].text)

Antonio Conte, l'entraîneur de Chelsea, a déclaré qu'il ne pensait pas que son club était en lice pour signer Alexis Sanchez, l'attaquant d'Arsenal, et a évité de discuter du marché des transferts, préférant laisser ces questions au club.
