### Schemas - Weaviate

Following [this](https://weaviate.io/developers/weaviate/quickstart) Weaviate tutorial

In [1]:
import weaviate
import json
import os
import requests
from dotenv import load_dotenv

In [2]:
JEOPARDY_DATA_SOURCE = "https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json"

In [3]:
open_ai_api_key = os.getenv("OPENAI_APIKEY")

In [4]:
client = weaviate.Client(
    url = "http://localhost:8080",  # Replace with your endpoint
    additional_headers = {
        "X-OpenAI-Api-Key": open_ai_api_key  # Replace with your inference API key
    }
)

### Load in Data

In [8]:
resp = requests.get(JEOPARDY_DATA_SOURCE)
data = json.loads(resp.text)

In [9]:
data

[{'Category': 'SCIENCE',
  'Question': 'This organ removes excess glucose from the blood & stores it as glycogen',
  'Answer': 'Liver'},
 {'Category': 'ANIMALS',
  'Question': "It's the only living mammal in the order Proboseidea",
  'Answer': 'Elephant'},
 {'Category': 'ANIMALS',
  'Question': 'The gavial looks very much like a crocodile except for this bodily feature',
  'Answer': 'the nose or snout'},
 {'Category': 'ANIMALS',
  'Question': 'Weighing around a ton, the eland is the largest species of this animal in Africa',
  'Answer': 'Antelope'},
 {'Category': 'ANIMALS',
  'Question': 'Heaviest of all poisonous snakes is this North American rattlesnake',
  'Answer': 'the diamondback rattler'},
 {'Category': 'SCIENCE',
  'Question': "2000 news: the Gunnison sage grouse isn't just another northern sage grouse, but a new one of this classification",
  'Answer': 'species'},
 {'Category': 'SCIENCE',
  'Question': 'A metal that is ductile can be pulled into this while cold & under pressur

### Defining a Class for the Data

*Note* - a class is a data collection in Weaviate that is used to store objects. Creating a class is anagolous to creating a table in a relational DB

In [10]:
class_obj = {
    "class": "Question",
    "vectorizer": "text2vec-openai",  # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
    "moduleConfig": {
        "text2vec-openai": {},
        "generative-openai": {}  # Ensure the `generative-openai` module is used for generative queries
    }
}

client.schema.create_class(class_obj)

Loading the data into the DB

In [11]:
client.batch.configure(batch_size=100)  # Configure batch
with client.batch as batch:  # Initialize a batch process
    for i, d in enumerate(data):  # Batch import data
        print(f"importing question: {i+1}")
        properties = {
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        }
        batch.add_data_object(
            data_object=properties,
            class_name="Question"
        )

importing question: 1
importing question: 2
importing question: 3
importing question: 4
importing question: 5
importing question: 6
importing question: 7
importing question: 8
importing question: 9
importing question: 10


### Querying the Data

In [5]:
response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text({"concepts": ["meteorology"]})
    .with_limit(2)
    .do()
)

In [6]:
response

{'data': {'Get': {'Question': [{'answer': 'the atmosphere',
     'category': 'SCIENCE',
     'question': 'Changes in the tropospheric layer of this are what gives us weather'},
    {'answer': 'Sound barrier',
     'category': 'SCIENCE',
     'question': 'In 70-degree air, a plane traveling at about 1,130 feet per second breaks it'}]}}}

### Querying the Data with a Filter

In [19]:
response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text({"concepts": ["trunk"]})
    .with_where({
        "path": ["category"],
        "operator": "Equal",
        "valueText": "ANIMALS"
    })
    .with_limit(2)
    .do()
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "Question": [
                {
                    "answer": "the nose or snout",
                    "category": "ANIMALS",
                    "question": "The gavial looks very much like a crocodile except for this bodily feature"
                },
                {
                    "answer": "Elephant",
                    "category": "ANIMALS",
                    "question": "It's the only living mammal in the order Proboseidea"
                }
            ]
        }
    }
}


In [20]:
response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text({"concepts": ["elephants"]})
    .with_generate(single_prompt="Explain {answer} as you might to a five-year-old.")
    .with_limit(2)
    .do()
)

print(json.dumps(response, indent=4))

{
    "data": {
        "Get": {
            "Question": [
                {
                    "_additional": {
                        "generate": {
                            "error": null,
                            "singleResult": "An elephant is a really big animal with a long trunk, big ears, and a strong body. They are usually gray in color. Elephants are very smart and friendly. They live in places called forests and grasslands. They eat lots of plants and fruits. They use their long trunk to grab food and drink water. Elephants also use their trunk to say hello to other elephants by touching them gently. They have big ears that help them hear really well. Elephants are very strong and can carry heavy things with their trunk. They are also great swimmers and love to play in the water. Elephants are loved by many people because they are so amazing and special!"
                        }
                    },
                    "answer": "Elephant",
                    "cat

### Tutorial to do next: https://weaviate.io/developers/weaviate/tutorials/schema

### Deleting Classes

In [7]:
response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text({"concepts": ["meteorology"]})
    .with_limit(2)
    .do()
)
print(response)

{'data': {'Get': {'Question': [{'answer': 'the atmosphere', 'category': 'SCIENCE', 'question': 'Changes in the tropospheric layer of this are what gives us weather'}, {'answer': 'Sound barrier', 'category': 'SCIENCE', 'question': 'In 70-degree air, a plane traveling at about 1,130 feet per second breaks it'}]}}}


In [9]:
if (client.schema.delete_class("Question") ):
  # delete collection "Question" - THIS WILL DELETE THE COLLECTION AND ALL ITS DATA
  client.schema.delete_class("Question")   # Replace with your collection name

Checking the data has been removed

In [10]:
response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text({"concepts": ["meteorology"]})
    .with_limit(2)
    .do()
)
print(response)

UnexpectedStatusCodeException: Query was not successful! Unexpected status code: 422, with response body: {'error': [{'message': 'no graphql provider present, this is most likely because no schema is present. Import a schema first!'}]}.

### More in-depth description of classes

Each Weaviate class
- Begins with a capital letter
- Is its own distinct vector space. A search in Weaviate is always restricted to a single class
- Can have its own vectoriser. Could use Azure Open AI model for one class and a different vectoriser such as Hugging Face for another class
- Has property values with a data type

In [12]:
class_obj = {
    "class": "Question",
    "description": "Information on Jeopardy questions.",
    "properties": [
        {
            "dataType": ["text"],
            "description": "The question",
            "name": "question"
        },
        {
            "dataType": ["text"],
            "description": "The answer",
            "name": "answer"
        },
        {
            "dataType": ["text"],
            "description": "The category",
            "name": "category"
        }
    ],
    "vectorizer": "text2vec-openai",
}

In [13]:
client.schema.create_class(class_obj)

In [14]:
schema = client.schema.get()

In [16]:
print(json.dumps(schema, indent=4))

{
    "classes": [
        {
            "class": "Question",
            "description": "Information on Jeopardy questions.",
            "invertedIndexConfig": {
                "bm25": {
                    "b": 0.75,
                    "k1": 1.2
                },
                "cleanupIntervalSeconds": 60,
                "stopwords": {
                    "additions": null,
                    "preset": "en",
                    "removals": null
                }
            },
            "moduleConfig": {
                "text2vec-openai": {
                    "baseURL": "https://api.openai.com",
                    "model": "ada",
                    "modelVersion": "002",
                    "type": "text",
                    "vectorizeClassName": true
                }
            },
            "multiTenancyConfig": {
                "enabled": false
            },
            "properties": [
                {
                    "dataType": [
                        "

### Class Specification Notes

It is possible to customise the class specification quite a bit. For example,
- The `dataType` can be updated which will impact how the data is tokenised (broken down into smaller units)
- Can choose to skip certain properties from being vectorised. For example, in the Jeopardy application it is possible to avoid the class name being vectorised
- Possible to update parameters of the search algorithms. For example, updating the `invertedIndexConfig` for updating the BM25 indexing algorithm or updated the `vectorIndexConfig` to update the `HNSW` parameters