# Vector Database Workshop with **Weaviate**

### Install Weaviate Python Client Library
The latest Weaviate Python client library can be installed using pip. The client library is tested on Python 3.8 and later. Install it using the following command

In [19]:
!pip install weaviate-client tqdm python-dotenv kagglehub

  pid, fd = os.forkpty()




In [20]:
from dotenv import load_dotenv
import os
import pandas as pd
import requests
from datetime import datetime, timezone
import json
from tqdm import tqdm

import weaviate
import weaviate.classes.config as wc
import weaviate.classes.query as wq
from weaviate.classes.init import Auth
from weaviate.util import generate_uuid5

### Create WCD Account (if you don't have one!!!)
To connect to the Weaviate Cloud (WCD) instance, you need to use the **cluster URL** and the **API key**. You can find these details in the WCD Console.

In [21]:
# Load environment variables from .env
load_dotenv()

# Access your variables
CLUSTER_URL = os.getenv('CLUSTER_URL')
WEAVIATE_API = os.getenv('WEAVIATE_API')
OPENAI_API = os.getenv('OPENAI_API')

print(CLUSTER_URL)  # your_secret_key_here

https://lmphqcwrtfopennuxlqd8g.c0.us-west3.gcp.weaviate.cloud


Use the **connect_to_weaviate_cloud** function to connect to your WCD instance.

In [22]:
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=CLUSTER_URL,  # Replace with your WCD URL
    auth_credentials=Auth.api_key(
        WEAVIATE_API # Replace with your WCD key
    ),
    headers = {"X-OpenAI-Api-Key": OPENAI_API} 
    )

Check Weaviate status

In [23]:
assert client.is_live()  # This will raise an exception if the client is not live

In [24]:
import json

metainfo = client.get_meta()
print(json.dumps(metainfo, indent=2))  # Print the meta information in a readable format

{
  "hostname": "http://[::]:8080",
  "modules": {
    "backup-gcs": {
      "bucketName": "weaviate-wcs-prod-cust-us-west3-workloads-backups",
      "rootName": "2cca61a9-cc11-4c53-a978-d9d45cba83f2"
    },
    "generative-anthropic": {
      "documentationHref": "https://docs.anthropic.com/en/api/getting-started",
      "name": "Generative Search - Anthropic"
    },
    "generative-anyscale": {
      "documentationHref": "https://docs.anyscale.com/endpoints/overview",
      "name": "Generative Search - Anyscale"
    },
    "generative-aws": {
      "documentationHref": "https://docs.aws.amazon.com/bedrock/latest/APIReference/welcome.html",
      "name": "Generative Search - AWS"
    },
    "generative-cohere": {
      "documentationHref": "https://docs.cohere.com/reference/chat",
      "name": "Generative Search - Cohere"
    },
    "generative-databricks": {
      "documentationHref": "https://docs.databricks.com/en/machine-learning/foundation-models/api-reference.html#completion-ta

### Source Data 

[Link text](https://example.com)
We are going to use a movie dataset sourced from [TMDB](https://www.themoviedb.org/). The dataset can be found in this [GitHub repository](https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json), and it contains bibliographic information on ~700 movies released between 1990 and 2024.

In [25]:
import pandas as pd

data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
resp = requests.get(data_url)
df = pd.DataFrame(resp.json())
df.head()

Unnamed: 0,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,/3Nn5BOM1EVw1IYrv6MsbOS6N1Ol.jpg,"[14, 18, 10749]",162,en,Edward Scissorhands,A small suburban town receives a visit from a ...,45.694,/1RFIbuW9Z3eN9Oxw2KaQG5DfLmD.jpg,1990-12-07,Edward Scissorhands,False,7.7,12305
1,/sw7mordbZxgITU877yTpZCud90M.jpg,"[18, 80]",769,en,GoodFellas,"The true story of Henry Hill, a half-Irish, ha...",57.228,/aKuFiU82s5ISJpGZp7YkIr3kCUd.jpg,1990-09-12,GoodFellas,False,8.5,12106
2,/6uLhSLXzB1ooJ3522ydrBZ2Hh0W.jpg,"[35, 10751]",771,en,Home Alone,Eight-year-old Kevin McCallister makes the mos...,3.538,/onTSipZ8R3bliBdKfPtsDuHTdlL.jpg,1990-11-16,Home Alone,False,7.4,10599
3,/vKp3NvqBkcjHkCHSGi6EbcP7g4J.jpg,"[12, 35, 878]",196,en,Back to the Future Part III,The final installment of the Back to the Futur...,28.896,/crzoVQnMzIrRfHtQw0tLBirNfVg.jpg,1990-05-25,Back to the Future Part III,False,7.5,9918
4,/3tuWpnCTe14zZZPt6sI1W9ByOXx.jpg,"[35, 10749]",114,en,Pretty Woman,When a millionaire wheeler-dealer enters a bus...,97.953,/hVHUfT801LQATGd26VPzhorIYza.jpg,1990-03-23,Pretty Woman,False,7.5,7671


In [26]:
df.shape

(680, 13)

### Create a collection
Weaviate stores data in "collections". A collection is a set of objects that share the same data structure. In our movie database, we might have a collection of movies, a collection of actors, and a collection of reviews.

In [27]:
colname = "Movie"
if not client.collections.exists(colname):
    client.collections.create(
        name=colname,
        properties=[
            wc.Property(name="title", data_type=wc.DataType.TEXT),
            wc.Property(name="overview", data_type=wc.DataType.TEXT),
            wc.Property(name="vote_average", data_type=wc.DataType.NUMBER),
            wc.Property(name="genre_ids", data_type=wc.DataType.INT_ARRAY),
            wc.Property(name="release_date", data_type=wc.DataType.DATE),
            wc.Property(name="tmdb_id", data_type=wc.DataType.INT),
        ],
        # Define the vectorizer module
        vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(),
        # Define the generative module
        generative_config=wc.Configure.Generative.openai()
    )

    # client.close()

### Populate the collection

This example imports the movie data into our collection.

* Loads the source data & gets the collection
* Enters a context manager with a batcher (batch) object
* Loops through the data and adds objects to the batcher
* Prints out any import errors

In [28]:
# Get the collection
movies = client.collections.get("Movie")

# Enter context manager
with movies.batch.dynamic() as batch:
    # Loop through the data
    for i, movie in tqdm(df.iterrows()):
        # Convert data types
        # Convert a JSON date to `datetime` and add time zone information
        release_date = datetime.strptime(movie["release_date"], "%Y-%m-%d").replace(
            tzinfo=timezone.utc
        )
        # Convert a JSON array to a list of integers
        genre_ids = json.loads(movie["genre_ids"])

        # Build the object payload
        movie_obj = {
            "title": movie["title"],
            "overview": movie["overview"],
            "vote_average": movie["vote_average"],
            "genre_ids": genre_ids,
            "release_date": release_date,
            "tmdb_id": movie["id"],
        }

        # Add object to batch queue
        batch.add_object(
            properties=movie_obj,
            uuid=generate_uuid5(movie["id"])
        )
        # Batcher automatically sends batches

# Check for failed objects
if len(movies.batch.failed_objects) > 0:
    print(f"Failed to import {len(movies.batch.failed_objects)} objects")

# client.close()

0it [00:00, ?it/s]

680it [00:00, 1649.72it/s]


### Semantic Search
This example finds entries in "Movie" based on their similarity to the query "dystopian future", and prints out the title and release year of the top 5 matches.

In [29]:
# Get the collection
movies = client.collections.get("Movie")

# Perform query
response = movies.query.near_text(
    query="poverty", limit=5, return_metadata=wq.MetadataQuery(distance=True)
)

# Inspect the response
for o in response.objects:
    print(
        o.properties["title"], o.properties["release_date"].year
    )  # Print the title and release year (note the release date is a datetime object)
    print(
        f"Distance to query: {o.metadata.distance:.3f}\n"
    )  # Print the distance of the object from the query

# client.close()

In Time 2011
Distance to query: 0.219

Parasite 2019
Distance to query: 0.227

The Pursuit of Happyness 2006
Distance to query: 0.227

City of God 2002
Distance to query: 0.229

Me Before You 2016
Distance to query: 0.234



In [30]:
df[df["title"]=="The Pursuit of Happyness"]

Unnamed: 0,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
329,/nKOQiWjhv6LXXSR3PiIab3LrKtU.jpg,[18],1402,en,The Pursuit of Happyness,A struggling salesman takes custody of his son...,44.821,/f6l9rghSHORkWLurUGJhaKAiyjY.jpg,2006-12-14,The Pursuit of Happyness,False,7.9,9370


In [31]:
# A struggling salesman takes custody of his son as he's poised to begin a life-changing professional career

### Keyword search

This example finds entries in "Movie" with the highest keyword search scores for the term "history", and prints out the title and release year of the top 5 matches.

In [32]:

# Get the collection
movies = client.collections.get("Movie")

# Perform query
response = movies.query.bm25(
    query="history", limit=5, return_metadata=wq.MetadataQuery(score=True)
)

# Inspect the response
for o in response.objects:
    print(
        o.properties["title"], o.properties["release_date"].year
    )  # Print the title and release year (note the release date is a datetime object)
    print(
        f"BM25 score: {o.metadata.score:.3f}\n"
    )  # Print the BM25 score of the object from the query

# client.close()

American History X 1998
BM25 score: 2.707

A Beautiful Mind 2001
BM25 score: 1.896

Legends of the Fall 1994
BM25 score: 1.663

Hacksaw Ridge 2016
BM25 score: 1.554

Night at the Museum 2006
BM25 score: 1.529



The above results are based on a keyword search score using what's called the [BM25f](https://en.wikipedia.org/wiki/Okapi_BM25) algorithm.

### Hybrid search

This example finds entries in "Movie" with the highest hybrid search scores for the term "history", and prints out the title and release year of the top 5 matches.

In [33]:
# Get the collection
movies = client.collections.get("Movie")

# Perform query
response = movies.query.hybrid(
    query="history", limit=5, return_metadata=wq.MetadataQuery(score=True)
)

# Inspect the response
for o in response.objects:
    print(
        o.properties["title"], o.properties["release_date"].year
    )  # Print the title and release year (note the release date is a datetime object)
    print(
        f"Hybrid score: {o.metadata.score:.3f}\n"
    )  # Print the hybrid search score of the object from the query

Legends of the Fall 1994
Hybrid score: 0.822

Hacksaw Ridge 2016
Hybrid score: 0.658

Oppenheimer 2023
Hybrid score: 0.608

A Beautiful Mind 2001
Hybrid score: 0.528

The Notebook 2004
Hybrid score: 0.510



The results are based on a hybrid search score. A hybrid search blends results of BM25 and semantic/vector searches.

### Filters
Filters can be used to precisely refine search results. You can filter by properties as well as metadata, and you can combine multiple filters with and or or conditions to further narrow down the results

In [34]:
# Get the collection
movies = client.collections.get("Movie")

# Perform query
response = movies.query.near_text(
    query="poverty",
    limit=5,
    return_metadata=wq.MetadataQuery(distance=True),
    filters=wq.Filter.by_property("release_date").greater_than(datetime(2010, 1, 1))
)

# Inspect the response
for o in response.objects:
    print(
        o.properties["title"], o.properties["release_date"].year
    )  # Print the title and release year (note the release date is a datetime object)
    print(
        f"Distance to query: {o.metadata.distance:.3f}\n"
    )  # Print the distance of the object from the query


            To use a different timezone, specify it in the datetime object. For example:
            datetime.datetime(2021, 1, 1, 0, 0, 0, tzinfo=datetime.timezone(-datetime.timedelta(hours=2))).isoformat() = 2021-01-01T00:00:00-02:00
            


In Time 2011
Distance to query: 0.219

Parasite 2019
Distance to query: 0.227

Me Before You 2016
Distance to query: 0.234

The Help 2011
Distance to query: 0.238

The Intouchables 2011
Distance to query: 0.241



### RAG: Overview

Retrieval augmented generation (RAG) is a way to combine the best of both worlds: the retrieval capabilities of semantic search and the generation capabilities of AI models such as large language models. This allows you to retrieve objects from a Weaviate instance and then generate outputs based on the retrieved objects

When we created a collection, we specified the generative_module parameter as shown here:

```python
generative_config=wc.Configure.Generative.openai()
```

This selects a generative module that will be used to generate outputs based on the retrieved objects. In this case, we're using the openai module, and the GPT family of large language models.

As we did before with the vectorizer module, you will require an API key from the provider of the generative module. In this case, you will need an API key from OpenAI.

### RAG queries

RAG queries are also called 'generative' queries in Weaviate. You can access these functions through the generate submodule of the collection object.

Each generative query works in addition to the regular search query, and will perform a RAG query on each retrieved object.



### Single Prompt

In [35]:
# Get the collection
movies = client.collections.get("Movie")

# Perform query
response = movies.generate.near_text(
    query="dystopian future",
    limit=5,
    single_prompt="Translate this into Tamil: {title}"
)

# Inspect the response
for o in response.objects:
    print(o.properties["title"])  # Print the title
    print(o.generated)  # Print the generated text (the title, in French)

# client.close()

In Time
நேரத்தில் (Nerathil)
Gattaca
கட்டாக்கா (Gattaca)
I, Robot
நான், இயந்திரம்
Mad Max: Fury Road
மேட் மேக்ஸ்: ஃபுரி ரோட்
The Maze Runner
தி மேஸ் ரன்னர்


### Grouped task generation

In [36]:
# Get the collection
movies = client.collections.get("Movie")

# Perform query
response = movies.generate.near_text(
    query="poverty",
    limit=5,
    grouped_task="What do these movies have in common. Can you summarize this in one word? you can give upto 3 choices",
    # grouped_properties=["title", "overview"]  # Optional parameter; for reducing prompt length
)

# Inspect the response
for o in response.objects:
    print(o.properties["title"])  # Print the title
print(response.generated)

In Time
Parasite
The Pursuit of Happyness
City of God
Me Before You
Struggle, Class Divide, Redemption


Exercise - Lets put what we learnt so far into action

In [37]:
# Lets source new data
data_csv = "/workspaces/odsc_west_intro_vector_db/labelled_newscatcher_dataset.csv"
df = pd.read_csv(data_csv,sep=";")
df["id"] = df.index
df.head()

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4


In [38]:
df.shape

(108774, 7)

In [39]:
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=CLUSTER_URL,  # Replace with your WCD URL
    auth_credentials=Auth.api_key(
        WEAVIATE_API # Replace with your WCD key
    ),
    headers = {"X-OpenAI-Api-Key": OPENAI_API} 
    )

In [60]:
# Create the collection
colname = "News"
# If the class exists before, we will delete it first
if client.collections.exists(colname):
    print("Deleting existing collection...News")
    client.collections.delete(colname)
    print("Creating new collection...News")
    client.collections.create(
        name=colname,
        properties=[
            wc.Property(name="topic", data_type=wc.DataType.TEXT),
            wc.Property(name="link", data_type=wc.DataType.TEXT),
            wc.Property(name="domain", data_type=wc.DataType.TEXT),
            wc.Property(name="published_date", data_type=wc.DataType.DATE),
            wc.Property(name="title", data_type=wc.DataType.TEXT),
            wc.Property(name="lang", data_type=wc.DataType.TEXT),
        ],
        # Define the vectorizer module
        vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(),
        # Define the generative module
        generative_config=wc.Configure.Generative.openai()
    )

    # client.close()

Deleting existing collection...News
Creating new collection...News


In [62]:
df["published_date"]  = pd.to_datetime(df["published_date"])
df=df[:10000]
df.reset_index(drop=True)
df.head()

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4


In [63]:
df.shape

(10000, 7)

In [64]:
# populate the collection
# Get the collection
news_collection = client.collections.get("News")

# Enter context manager
with news_collection.batch.dynamic() as batch:
    # Loop through the data
    for i, news in tqdm(df.iterrows()):
        
        # Build the object payload
        news_obj = {
            "topic": news["topic"],
            "link": news["link"],
            "domain": news["domain"],
            "published_date": news["published_date"],
            "title": news["title"],
            "lang": news["lang"],
        }

        # Add object to batch queue
        batch.add_object(
            properties=news_obj,
            uuid=generate_uuid5(news["id"])
        )
        # Batcher automatically sends batches

# Check for failed objects
if len(news_collection.batch.failed_objects) > 0:
    print(f"Failed to import {len(news_collection.batch.failed_objects)} objects")

# client.close()

0it [00:00, ?it/s]

10000it [00:27, 370.16it/s]


In [73]:
#semantic search

# Get the collection
news = client.collections.get("News")

# Perform query
response = news.query.near_text(
    query="crime", limit=5, return_metadata=wq.MetadataQuery(distance=True)
)

# Inspect the response
for o in response.objects:
    print(
        o.properties["title"], o.properties["topic"]
    )  # Print the title and topic
    print(
        f"Distance to query: {o.metadata.distance:.3f}\n"
    )  # Print the distance of the object from the query

# client.close()

Borno police parades 45 criminal suspects for various offences WORLD
Distance to query: 0.185

Senior Comanchero member admits drug charges, money laundering ahead of trial WORLD
Distance to query: 0.188

21 of the youngest criminals locked up in 2020 so far on Teesside HEALTH
Distance to query: 0.189

Comanchero gang raids: Jarome Fonua pleads guilty to money laundering, participating in an organised criminal group WORLD
Distance to query: 0.189

India rape: Two men arrested for 13-year-old's rape and murder WORLD
Distance to query: 0.189



In [74]:
# keyword search

# Get the collection
news = client.collections.get("News")

# Perform query
response = news.query.bm25(
    query="crime", limit=5, return_metadata=wq.MetadataQuery(score=True)
)

# Inspect the response
for o in response.objects:
    print(
        o.properties["title"], o.properties["topic"]
    )  # Print the title and topic
    print(
        f"BM25 score: {o.metadata.score:.3f}\n"
    )  # Print the BM25 score of the object from the query

# client.close()

Rhino poaching – not just an environmental crime WORLD
BM25 score: 2.309

What the jury in the trial of Aaron Brady did not hear WORLD
BM25 score: 2.303

Beirut explosion: Investigators given four days to get to bottom of ‘crime’ WORLD
BM25 score: 1.925

'Barbaric' Horse Killings Put French Countryside on Alert HEALTH
BM25 score: 1.921

Ways AI could be used to facilitate crime over the next 15 years TECHNOLOGY
BM25 score: 1.886

