# Weaviate

### VectorDB Setup

Retrieve env vars and connect to Weaviate instance running in local docker container. 
 - Set up basic authentication to the db instance using an API key
 - Pass in your OpenAI API key to enable API calls to OpenAI models for creating vector embeddings and generating model outputs

In [1]:
import os
import json
import requests
import weaviate

from dotenv import load_dotenv, find_dotenv

In [2]:
_ = load_dotenv(find_dotenv()) # read local .env file

weaviate_url = os.getenv("WEAVIATE_URL") 
weaviate_key = os.getenv("WEAVIATE_API_KEY")
openai_key = os.getenv("OPENAI_API_KEY")

In [3]:
# Connect to local Weaviate instance running in docker
weaviate_client = weaviate.Client(
    url=weaviate_url,  
    auth_client_secret=weaviate.auth.AuthApiKey(api_key=weaviate_key),  
    additional_headers={
        "X-OpenAI-Api-Key": openai_key
    }
)
weaviate_client.is_ready()

            your code to use Python client v4 `weaviate.WeaviateClient` connections and methods.

            For Python Client v4 usage, see: https://weaviate.io/developers/weaviate/client-libraries/python
            For code migration, see: https://weaviate.io/developers/weaviate/client-libraries/python/v3_v4_migration
            


True

### Collect your dataset

Download a tiny dataset as a simple example from the weaviate tutorial.

In [4]:
# Download a tiny Q&A bank for a tutorial
resp = requests.get('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
data = json.loads(resp.text)  # Load data

In [5]:
def json_print(data):
    print(json.dumps(data, indent=2))

json_print(data[0])

{
  "Category": "SCIENCE",
  "Question": "This organ removes excess glucose from the blood & stores it as glycogen",
  "Answer": "Liver"
}


In [5]:
# save the dataset locally
with open("../data/jeopardy_tiny.json", "w") as f:
    json.dump(data, f)

### Define a schema

A schema must be defined before data is imported into the vector DB. It is recommended to manually define the schema, though Weaviate can infer the schema.

`Collections` are groups of objects which share a common structure, and different collections are isolated from one another. For example, you might have a movie database with Movie and Actor collections, each with their own properties. Each collection has its own properties, vectorizer modules, index settings, replication/sharding settings.

As an example, a simple text question and answer collection.

In [6]:
# resetting the schema. CAUTION: This will delete your collection 
if weaviate_client.schema.exists("QuizQuestionBank"):
    weaviate_client.schema.delete_class("QuizQuestionBank")

In [7]:
# we will create the class 
class_obj = {
    "class": "QuizQuestionBank",
    "description": "Jeopardy! question and answer bank.",  # description of the class
    # set the module for creating vector embeddings
    "vectorizer": "text2vec-openai",
    # configure module for generating outputs 
    # -> provide generation config here, e.g. temperature, top p, max tokens etc.
    "moduleConfig": {
        "generative-openai": {
            "model": "gpt-3.5-turbo",
        }  
    },
    # configure the object data structure
    "properties": [
        {
            "name": "question",
            "dataType": ["text"],
            "description": "The question",
            "moduleConfig": {
                "text2vec-openai": {  # this must match the vectorizer used
                    "vectorizePropertyName": True,
                    "tokenization": "lowercase",
                    "model": "text-embedding-3-small",
                    "dimensions": 1536,
                    "type": "text",
                }
            }
        },
        {
            "name": "answer",
            "dataType": ["text"],
            "description": "The answer",
            "moduleConfig": {
                "text2vec-openai": {  # this must match the vectorizer used
                    "vectorizePropertyName": False,
                    "tokenization": "whitespace"
                }
            }
        },
    ],
    # Configure the vector index
    "vectorIndexType": "hnsw",
    "vectorIndexConfig": {
        "distance": "cosine",
        "pq": {
            "enabled": True,
            "segments": 192
        },
    },
    # Configure the inverted index
    "indexTimestamps": True,
    "indexNullState": True,
    "indexPropertyLength": True,
}

# add the schema
weaviate_client.schema.create_class(class_obj)


### Populate VectorDB

Load the sample data into the vector db according to the schema defined above, using the vectorizer to generate vector embeddings. 

This calls the OpenAI API to convert the Questions/Answers/Categories into embeddings using one of the OpenAI text2vec models.

In [8]:
with weaviate_client.batch.configure(batch_size=5) as batch:
    for i, d in enumerate(data):  # Batch import data
        
        print(f"Importing question: {i+1}")
        properties = {
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        }
        
        batch.add_data_object(
            data_object=properties,
            class_name="QuizQuestionBank"
        )

Importing question: 1
Importing question: 2
Importing question: 3
Importing question: 4
Importing question: 5
Importing question: 6
Importing question: 7
Importing question: 8
Importing question: 9
Importing question: 10


In [21]:
count = weaviate_client.query.aggregate("QuizQuestionBank").with_meta_count().do()
json_print(count)

{
  "data": {
    "Aggregate": {
      "QuizQuestionBank": [
        {
          "meta": {
            "count": 10
          }
        }
      ]
    }
  }
}


### Basic search examples

#### Retrieve a database record

Example basic search for items in the database, retrieving the chosen fields and record id.

In [29]:
response = weaviate_client.query.get("QuizQuestionBank", ["question","category"]).with_additional("id").with_limit(2).do()
json_print(response)

{
  "data": {
    "Get": {
      "QuizQuestionBank": [
        {
          "_additional": {
            "id": "291abea7-7831-4045-a415-25a440005621"
          },
          "category": "SCIENCE",
          "question": "A metal that is ductile can be pulled into this while cold & under pressure"
        },
        {
          "_additional": {
            "id": "29fa3349-478b-479f-a575-4a3e2b58462a"
          },
          "category": "ANIMALS",
          "question": "The gavial looks very much like a crocodile except for this bodily feature"
        }
      ]
    }
  }
}


#### Retrieve a vector embedding

Query the vectorDB for a question/answer record and the corresponding vector embedding

In [28]:
# write a query to extract the vector for a question (the first one)
result = (weaviate_client.query
          .get("QuizQuestionBank", ["category", "question", "answer"])
          .with_additional("vector")
          .with_limit(1)
          .do())

json_print(result)

{
  "data": {
    "Get": {
      "QuizQuestionBank": [
        {
          "_additional": {
            "vector": [
              -0.025137525,
              0.008355691,
              0.014935974,
              -0.023460751,
              -0.0118924165,
              0.020064931,
              -0.044836104,
              -0.024362545,
              -0.052416813,
              -0.040270768,
              0.014055315,
              0.04035531,
              0.0045265863,
              0.012441948,
              0.0013509307,
              0.010483363,
              -0.012174228,
              0.038072642,
              -0.003404627,
              -0.030322846,
              -0.037396297,
              0.0075454847,
              0.020642644,
              -0.016866378,
              -0.0042447755,
              0.006876184,
              0.02626477,
              -0.029815586,
              -0.015851859,
              0.013047841,
              0.014795069,
              0.017091827,
  

### Vector similarity search

#### Search with text

Find the objects with the nearest vector to an input text.

In [30]:
response = (
    weaviate_client.query
    .get("QuizQuestionBank",["question","answer","category"])
    .with_near_text({"concepts": "biology"})
    .with_additional('distance')
    .with_limit(2)
    .do()
)
json_print(response)

{
  "data": {
    "Get": {
      "QuizQuestionBank": [
        {
          "_additional": {
            "distance": 0.19039947
          },
          "answer": "DNA",
          "category": "SCIENCE",
          "question": "In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance"
        },
        {
          "_additional": {
            "distance": 0.19759446
          },
          "answer": "species",
          "category": "SCIENCE",
          "question": "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification"
        }
      ]
    }
  }
}


#### Keyword search

Find the objects in the DB that have the highest BM25F (Best match 25) scores by using `Keyword` search or "sparse vector" search. This matches the input vector to one from the database that has matching words. 

In [39]:
response = (
    weaviate_client.query
    .get("QuizQuestionBank", ["question", "answer"])
    .with_bm25(
      query="grouse",
      properties=["question"]
    )
    .with_additional("score")
    .with_limit(3)
    .do()
)
json_print(response)

{
  "data": {
    "Get": {
      "QuizQuestionBank": [
        {
          "_additional": {
            "score": "1.1326916"
          },
          "answer": "species",
          "question": "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification"
        }
      ]
    }
  }
}


#### Hybrid search

Combines the results of the keyword and vector similarity search using a ranking method.

In [45]:
response = (
    weaviate_client.query
    .get("QuizQuestionBank", ["question", "answer"])
    .with_hybrid(
        query="food"
    )
    .with_additional(["score","explainScore"])
    .with_limit(2)
    .do()
)
json_print(response)

{
  "data": {
    "Get": {
      "QuizQuestionBank": [
        {
          "_additional": {
            "explainScore": "\nHybrid (Result Set vector,hybridVector) Document fef454e8-9306-41bf-afe8-37f45a476613: original score 0.76148003, normalized score: 0.75",
            "score": "0.75"
          },
          "answer": "DNA",
          "question": "In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance"
        },
        {
          "_additional": {
            "explainScore": "\nHybrid (Result Set vector,hybridVector) Document 326f69ee-d8ea-4a77-8075-952a83ec6fbe: original score 0.76132846, normalized score: 0.74684024",
            "score": "0.74684024"
          },
          "answer": "species",
          "question": "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification"
        }
      ]
    }
  }
}


Hybrid search results can favour the keyword or vector search component. To change the weighting, pass the `alpha` value in the query.
- An `alpha` value of `1` is pure vector search
- An `alpha` value of `0` is pure keyword search 

In [47]:
response = (
    weaviate_client.query
    .get("QuizQuestionBank", ["question", "answer"])
    .with_hybrid(
        query="grouse",
        alpha=0.25
    )
    .with_additional(["score","explainScore"])
    .with_limit(2)
    .do()
)
json_print(response)

{
  "data": {
    "Get": {
      "QuizQuestionBank": [
        {
          "_additional": {
            "explainScore": "\nHybrid (Result Set keyword,bm25) Document 326f69ee-d8ea-4a77-8075-952a83ec6fbe: original score 0.74657804, normalized score: 0.75 - \nHybrid (Result Set vector,hybridVector) Document 326f69ee-d8ea-4a77-8075-952a83ec6fbe: original score 0.8128271, normalized score: 0.25",
            "score": "1"
          },
          "answer": "species",
          "question": "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification"
        },
        {
          "_additional": {
            "explainScore": "\nHybrid (Result Set vector,hybridVector) Document 29fa3349-478b-479f-a575-4a3e2b58462a: original score 0.782499, normalized score: 0.17130345",
            "score": "0.17130345"
          },
          "answer": "the nose or snout",
          "question": "The gavial looks very much like a crocodile except for this bodil

### Generative search

AKA RAG - this is a multi-stage process. First, Weaviate queries its collections, and then passes the retrieved results to a large language model (LLM) to generate an output. Here we are using the OpenAI API to query the `GPT-3.5-turbo` model for an output. 

In [48]:
generate_prompt = "Convert this quiz question: {question} and answer: {answer} into a trivia tweet."

response = (
  weaviate_client.query
  .get("QuizQuestionBank")
  .with_generate(single_prompt=generate_prompt)
  .with_near_text({
    "concepts": ["World history"]
  })
  .with_limit(1)
).do()

json_print(response)

{
  "data": {
    "Get": {
      "QuizQuestionBank": [
        {
          "_additional": {
            "generate": {
              "error": null,
              "singleResult": "Did you know that in 1953, Watson & Crick constructed a model of the gene-carrying substance DNA? #trivia #sciencefacts"
            }
          }
        }
      ]
    }
  }
}


In [49]:
generate_prompt = "What do these animals have in common, if anything?"

response = (
  weaviate_client.query
  .get("QuizQuestionBank")
  .with_generate(grouped_task=generate_prompt)
  .with_near_text({
    "concepts": ["Animals"]
  })
  .with_limit(2)
).do()

json_print(response)

{
  "data": {
    "Get": {
      "QuizQuestionBank": [
        {
          "_additional": {
            "generate": {
              "error": null,
              "groupedResult": "These animals have in common that they are both related to the order Proboseidea. The elephant is the only living mammal in this order, while the gavial, a type of crocodile, shares similarities with the order in terms of physical features."
            }
          }
        },
        {
          "_additional": null
        }
      ]
    }
  }
}
