# Vector Databases

## Typical Vector Database Operations

- Database Setup
- Load Documents
- Create sparse vectors for keyword search
- Create dense vectors for semantic search
- Create HNSW index to power ANN algorithm
- Run searches

## Weaviate

### What is Weaviate

[Weaviate](https://weaviate.io/) is an open-source vector database, which is available on [github](https://github.com/weaviate/weaviate).

### Quick Start

Weaviate can be run on cloud or hosted locally. Here we demonstrate how to host weaviate locally.

#### Prerequisites

- Docker, Ollama
- Models: `nomic-embed-text`, `llama3.2`

```
ollama pull nomic-embed-text
ollama pull llama3.2
```

<div style="text-align: center;">
  <img src="images/000.weaviate.00.prerequisites.png" alt="Weaviate Prerequisites" style="width: 70%;">
</div>

## Step 1: Set up Weaviate

### 1.1 Architectures

In this exercise, I set up Weaviate with two different architectures, and for simplicity, mainly focus on the first architecture.

- **Arch1**: run _Weaviate_ in docker, and run _Ollama_ locally in the host machine
- **Arch2**: run both _Weaviate_ and _Ollama_ in docker containers

`docker-compose.yml`
```docker-compose.yml
services:
  weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:1.32.2
    ports:
      - 8080:8080
      - 50051:50051
    volumes:
      - weaviate_data:/var/lib/weaviate
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      ENABLE_API_BASED_MODULES: 'true'
      ENABLE_MODULES: 'text2vec-ollama,generative-ollama'
      CLUSTER_HOSTNAME: 'node1'
    command:
      - --host
      - 0.0.0.0
      - --port
      - '8080'
      - --scheme
      - http
    networks:
      - weaviate_net

  ollama:
    image: ollama/ollama
    container_name: ollama
    ports:
      - "11435:11434"
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - OLLAMA_MODELS=/opt/models
    volumes:
      - ollama_data:/root/.ollama
      - ./models:/opt/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

networks:
  weaviate_net:
volumes:
  weaviate_data:
  ollama_data:
```

### 1.2 Create a Weaviate Database

<div style="text-align: center;">
  <img src="images/000.weaviate.01.docker.arch2.png" alt="Weaviate Container" style="width: 95%;">
</div>

### 1.3 Install a client library

```
pip install -U weaviate-client
```

### 1.4 Connect to Weaviate

In [1]:
import weaviate

In [2]:
client = weaviate.connect_to_local()
print(client.is_ready())

True


In [3]:
client.close()

## Step 2: Populate the database

### 2.1 Define a collection

> **What is a Collection?**
> 
> A collection is a set of objects that share the same data structure, like a table in relational databases or a collection in NoSQL databases. A collection also includes additional configurations that define how the data objects are stored and indexed.

The following example creates a _collection_ called `Question` with:
- Ollama [embedding model integration](https://docs.weaviate.io/weaviate/model-providers/ollama/embeddings) to create vectors during ingestion & queries, using the `nomic-embed-text` model, and
- Ollama [generative AI integrations](https://docs.weaviate.io/weaviate/model-providers/ollama/generative) for retrieval augmented generation (RAG), using the `llama3.2` model.

In [4]:
import weaviate
from weaviate.classes.config import Configure

In [5]:
client = weaviate.connect_to_local()

In [7]:
COLLECTION_NAME="Question"

In [8]:
if client.collections.exists(COLLECTION_NAME):
    client.collections.delete(COLLECTION_NAME)

In [9]:
if not client.collections.exists(COLLECTION_NAME):
    questions = client.collections.create(
        name="Question",
        vector_config=Configure.Vectors.text2vec_ollama(  # Configure the Ollama embedding integration
            api_endpoint="http://host.docker.internal:11434",  # If running Ollama in Docker, you might need: http://ollama:11434
            model="nomic-embed-text",  # The model to use
        ),
        generative_config=Configure.Generative.ollama(  # Configure the Ollama generative integration
            api_endpoint="http://host.docker.internal:11434",  # If running Ollama in Docker, you might need: http://ollama:11434
            model="llama3.2",  # The model to use
        ),
    )

In [10]:
client.close()

### 2.2 Add Objects

We can now add data to our collection.

The following example:

- Loads objects, and
- Adds objects to the target collection (`Question`) using a batch process.

> BATCH IMPORTS
>
> ([Batch imports](https://docs.weaviate.io/weaviate/manage-objects/import)) are the most efficient way to add large amounts of data, as it sends multiple objects in a single request. See the [How-to: Batch import](https://docs.weaviate.io/weaviate/manage-objects/import) guide for more information.


In [11]:
import weaviate
import requests, json

client = weaviate.connect_to_local()

In [12]:
file_name = "./jeopardy_tiny.json"
try:
    with open(file_name, 'r') as file:
        data = json.load(file)
    print(data)
except FileNotFoundError:
    print(f"Error: File '{file_name}' not found.")
except json.JSONDecodeError as e:
    print(f"Invalid JOSN format: {e}")

[{'Category': 'SCIENCE', 'Question': 'This organ removes excess glucose from the blood & stores it as glycogen', 'Answer': 'Liver'}, {'Category': 'ANIMALS', 'Question': "It's the only living mammal in the order Proboseidea", 'Answer': 'Elephant'}, {'Category': 'ANIMALS', 'Question': 'The gavial looks very much like a crocodile except for this bodily feature', 'Answer': 'the nose or snout'}, {'Category': 'ANIMALS', 'Question': 'Weighing around a ton, the eland is the largest species of this animal in Africa', 'Answer': 'Antelope'}, {'Category': 'ANIMALS', 'Question': 'Heaviest of all poisonous snakes is this North American rattlesnake', 'Answer': 'the diamondback rattler'}, {'Category': 'SCIENCE', 'Question': "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification", 'Answer': 'species'}, {'Category': 'SCIENCE', 'Question': 'A metal that is ductile can be pulled into this while cold & under pressure', 'Answer': 'wire'}, {'Categor

In [20]:
questions = client.collections.get(COLLECTION_NAME)

In [21]:
from weaviate.util import generate_uuid5
from tqdm import tqdm

In [22]:
# Set up a batch process with specified fixed size and concurrency
with questions.batch.fixed_size(batch_size=1, concurrent_requests=1) as batch:
    # Iterate over a subset of the dataset
    for document in tqdm(data):
        # Generate a UUID based on the article_content
        uuid = generate_uuid5(document)

        # Add the object to the batch with properties and UUID
        batch.add_object(
            properties=document,
            uuid=uuid,
        )

        if batch.number_errors > 10:
            print("Batch import stopped due to excessive errors.")
            break

failed_objects = questions.batch.failed_objects
if failed_objects:
    print(f"Number of failed imports: {len(failed_objects)}")
    print(f"First failed object: {failed_objects[0]}")

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 105.18it/s]


In [23]:
client.close()

# Step 3: Queries

## 3.1 Semantic Search

In [30]:
import weaviate
import json

client = weaviate.connect_to_local()
questions = client.collections.get(COLLECTION_NAME)

In [31]:
response = questions.query.near_text(
    query = "biology",
    limit = 2
)

for obj in response.objects:
    print(json.dumps(obj.properties, indent=2))

client.close()

{
  "question": "In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance",
  "category": "SCIENCE",
  "answer": "DNA"
}
{
  "question": "This organ removes excess glucose from the blood & stores it as glycogen",
  "category": "SCIENCE",
  "answer": "Liver"
}


## 3.2 Retrieval Augmented Generation

In [71]:
import weaviate

client = weaviate.connect_to_local()

questions = client.collections.get(COLLECTION_NAME)

response = questions.generate.near_text(
    query="biology",
    limit=2,
    grouped_task="Write a tweet with emojis about these facts."
)

print(response.generative.text) # Inspect the generated text

"Did you know? 🤔💡 DNA was first modeled by Watson & Crick in 1953! 💻 And, did you know your liver is a sugar superstar? 🍽️ It removes excess glucose from the blood and stores it as glycogen! 💪 #ScienceFacts #DNA #Liver #Glycogen"


# 4. My other exsamples

In [40]:
import joblib
from pprint import pprint
from weaviate.classes.config import Configure, Property, DataType
from weaviate.classes.query import MetadataQuery
import weaviate
from weaviate.exceptions import WeaviateInvalidInputError

In [41]:
data = joblib.load("data.joblib")
pprint(data[0])

{'attractions': 'South Rim, Havasu Falls, Skywalk',
 'best_season_to_visit': 'Spring, Fall',
 'budget': 'Moderate',
 'description': 'A stunning canyon with vast vistas and incredible geology.',
 'last_updated': '2023-10-01T00:00:00Z',
 'place': 'Grand Canyon',
 'state': 'Arizona',
 'user_ratings': 4.8}


In [72]:
if client:
    client.close()
client = weaviate.connect_to_local()
try:
    if client.collections.exists("Article"):
        client.collections.delete("Article")
    client.collections.create(
        name="Article",
        vector_config=Configure.Vectors.text2vec_ollama(  # Configure the Ollama embedding integration
            api_endpoint="http://host.docker.internal:11434",
            model="nomic-embed-text",  # The model to use
        ),
        generative_config=Configure.Generative.ollama(  # Configure the Ollama generative integration
            api_endpoint="http://host.docker.internal:11434",
            model="llama3.2",  # The model to use
        ),
        properties=[  # Define properties
            Property(name="place",vectorize_property_name=True,data_type= DataType.TEXT),
            Property(name="state",vectorize_property_name=True, data_type=DataType.TEXT),
            Property(name="description",vectorize_property_name=True, data_type=DataType.TEXT),
            Property(name="best_season_to_visit",vectorize_property_name=True, data_type=DataType.TEXT),
            Property(name="attractions",vectorize_property_name=True, data_type=DataType.TEXT),
            Property(name="budget",vectorize_property_name=True, data_type=DataType.TEXT),
            Property(name="user_ratings", data_type=DataType.NUMBER),
            Property(name="last_updated", data_type=DataType.DATE),

        ]
    )
except WeaviateInvalidInputError as err:
    print(f"Creating collection failed: {err}")

In [73]:
client.collections.list_all().keys()

dict_keys(['Article', 'Question'])

In [76]:
if client:
    client.close()
client = weaviate.connect_to_local()
articles = client.collections.get("Article")

In [77]:
with articles.batch.fixed_size(batch_size=1, concurrent_requests=1) as batch:
    for document in tqdm(data):
        uuid = generate_uuid5(document)
        batch.add_object(
            properties=document,
            uuid=uuid,
        )

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 99.66it/s]


In [78]:
if client:
    client.close()
client = weaviate.connect_to_local()
articles = client.collections.get("Article")
response = articles.query.near_text(
    query = "hotel capacity in downtown Vancouver",
    limit = 2,
    return_metadata=MetadataQuery(distance=True)
)
for o in response.objects:
    pprint(o.properties)
    print(f"o.metadata.distance = {o.metadata.distance}\n\n")

{'attractions': 'Cellhouse Tour, Alcatraz Lighthouse',
 'best_season_to_visit': 'Spring, Summer',
 'budget': 'Low',
 'description': 'Famed former prison island located in San Francisco Bay.',
 'last_updated': datetime.datetime(2023, 10, 1, 0, 0, tzinfo=datetime.timezone.utc),
 'place': 'Alcatraz Island',
 'state': 'California',
 'user_ratings': 4.4}
o.metadata.distance = 0.5683944225311279


{'attractions': 'Walk of Fame, Hollywood Sign',
 'best_season_to_visit': 'Spring',
 'budget': 'Moderate',
 'description': 'Famous district in Los Angeles known as the entertainment '
                'capital of the world.',
 'last_updated': datetime.datetime(2023, 10, 1, 0, 0, tzinfo=datetime.timezone.utc),
 'place': 'Hollywood',
 'state': 'California',
 'user_ratings': 4.2}
o.metadata.distance = 0.5694319605827332




In [86]:
if client:
    client.close()
client = weaviate.connect_to_local()
questions = client.collections.get(COLLECTION_NAME)
response = questions.query.bm25(
    query="food",
    limit=3,
)
for o in response.objects:
    pprint(o.properties)

In [85]:
if client:
    client.close()
client = weaviate.connect_to_local()
questions = client.collections.get(COLLECTION_NAME)
response = questions.query.near_text(
    query="violence",
    limit=3,
)
for o in response.objects:
    pprint(o.properties)

{'answer': 'the diamondback rattler',
 'category': 'ANIMALS',
 'question': 'Heaviest of all poisonous snakes is this North American '
             'rattlesnake'}
{'answer': 'Sound barrier',
 'category': 'SCIENCE',
 'question': 'In 70-degree air, a plane traveling at about 1,130 feet per '
             'second breaks it'}
{'answer': 'the nose or snout',
 'category': 'ANIMALS',
 'question': 'The gavial looks very much like a crocodile except for this '
             'bodily feature'}
