## Demo Flow
The demo flow is:
- **Prerequisites Setup**: Create a Weaviate instance and install required libraries
- **Connect**: Connect to your Weaviate instance 
- **Schema Configuration**: Configure the schema of your data
    - *Note*: Here we can define which OpenAI Embedding Model to use
    - *Note*: Here we can configure which properties to index
- **Import data**: Load a demo dataset and import it into Weaviate
    - *Note*: The import process will automatically index your data - based on the configuration in the schema
    - *Note*: You don't need to explicitly vectorize your data, Weaviate will communicate with OpenAI to do it for you
- **Run Queries**: Query 
    - *Note*: You don't need to explicitly vectorize your queries, Weaviate will communicate with OpenAI to do it for you
    - *Note*: The `qna-openai` module automatically communicates with the OpenAI completions endpoint

Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases for question answering.

## OpenAI Module in Weaviate
All Weaviate instances come equipped with the [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) and the [qna-openai](https://weaviate.io/developers/weaviate/modules/reader-generator-modules/qna-openai) modules.

The first module is responsible for handling vectorization at import (or any CRUD operations) and when you run a search query. The second module communicates with the OpenAI completions endpoint.

### No need to manually vectorize data
With [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) you don't need to manually vectorize your data, as Weaviate will call OpenAI for you whenever necessary.

All you need to do is:
1. provide your OpenAI API Key – when you connected to the Weaviate Client
2. define which OpenAI vectorizer to use in your Schema

## Prerequisites

Before we start this project, we need setup the following:

* create a `Weaviate` instance
* install libraries
    * `weaviate-client`
    * `datasets`
    * `apache-beam`
* get your [OpenAI API key](https://beta.openai.com/account/api-keys)

===========================================================
### Create a Weaviate instance

To create a Weaviate instance we have 2 options:

1. (Recommended path) [Weaviate Cloud Service](https://console.weaviate.io/) – to host your Weaviate instance in the cloud. The free sandbox should be more than enough for this cookbook.
2. Install and run Weaviate locally with Docker.

#### Option 1 – WCS Installation Steps

Use [Weaviate Cloud Service](https://console.weaviate.io/) (WCS) to create a free Weaviate cluster.
1. create a free account and/or login to [WCS](https://console.weaviate.io/)
2. create a `Weaviate Cluster` with the following settings:
    * Sandbox: `Sandbox Free`
    * Weaviate Version: Use default (latest)
    * OIDC Authentication: `Disabled`
3. your instance should be ready in a minute or two
4. make a note of the `Cluster Id`. The link will take you to the full path of your cluster (you will need it later to connect to it). It should be something like: `https://your-project-name.weaviate.network` 

#### Option 2 – local Weaviate instance with Docker

Install and run Weaviate locally with Docker.
1. Download the [./docker-compose.yml](./docker-compose.yml) file
2. Then open your terminal, navigate to where your docker-compose.yml file is located, and start docker with: `docker-compose up -d`
3. Once this is ready, your instance should be available at [http://localhost:8080](http://localhost:8080)

Note. To shut down your docker instance you can call: `docker-compose down`

##### Learn more
To learn more, about using Weaviate with Docker see the [installation documentation](https://weaviate.io/developers/weaviate/installation/docker-compose).

===========================================================    
## Install required libraries

Before running this project make sure to have the following libraries:

### Weaviate Python client

The [Weaviate Python client](https://weaviate.io/developers/weaviate/client-libraries/python) allows you to communicate with your Weaviate instance from your Python project.

### datasets & apache-beam

To load sample data, you need the `datasets` library and its' dependency `apache-beam`.

In [3]:
# Install the Weaviate client for Python
!pip3 install weaviate-client>3.11.0

# Install datasets and apache-beam to load the sample datasets
!pip3 install datasets apache-beam

You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
Collecting datasets
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[K     |████████████████████████████████| 542 kB 1.9 MB/s eta 0:00:01
[?25hCollecting apache-beam
  Downloading apache-beam-2.55.1.tar.gz (2.4 MB)
[K     |████████████████████████████████| 2.4 MB 396 kB/s eta 0:00:01
  distutils: /private/var/folders/gs/7cbp7gqn71s1v401j22p1xlh0000gn/T/pip-build-env-92n10m8c/normal/lib/python3.9/site-packages
  sysconfig: /Library/Python/3.9/site-packages[0m
  distutils: /private/var/folders/gs/7cbp7gqn71s1v401j22p1xlh0000gn/T/pip-build-env-92n10m8c/normal/lib/python3.9/site-packages
  sysconfig: /Library/Python/3.9/site-packages[0m
  user = False
  home = None
  root = None
  prefix = '/private/var/folders/gs/7cbp7gqn71s1v401j22p1xlh0000gn/T/pip-build-env-92n10m8c

===========================================================
## Prepare your OpenAI API key

The `OpenAI API key` is used for vectorization of your data at import, and for queries.

If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).

Once you get your key, please add it to your environment variables as `OPENAI_API_KEY`.

In [4]:
# Export OpenAI API Key
!export OPENAI_API_KEY=""

In [6]:
# Test that your OpenAI API key is correctly set as an environment variable
# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.
import os

# Note. alternatively you can set a temporary env variable like this:
os.environ['OPENAI_API_KEY'] = ''

if os.getenv("OPENAI_API_KEY") is not None:
    print ("OPENAI_API_KEY is ready")
else:
    print ("OPENAI_API_KEY environment variable not found")

OPENAI_API_KEY is ready


## Connect to your Weaviate instance

In this section, we will:

1. test env variable `OPENAI_API_KEY` – **make sure** you completed the step in [#Prepare-your-OpenAI-API-key](#Prepare-your-OpenAI-API-key)
2. connect to your Weaviate your `OpenAI API Key`
3. and test the client connection

### The client 

After this step, the `client` object will be used to perform all Weaviate-related operations.

In [7]:
import weaviate
from datasets import load_dataset
import os

# Connect to your Weaviate instance
client = weaviate.Client(
  #  url="https://your-wcs-instance-name.weaviate.network/",
   url="https://vectordb.bsid.io/",
   # auth_client_secret=weaviate.auth.AuthApiKey(api_key="<YOUR-WEAVIATE-API-KEY>"), # comment out this line if you are not using authentication for your Weaviate instance (i.e. for locally deployed instances)
    additional_headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)

# Check if your instance is live and ready
# This should return `True`
client.is_ready()

  from .autonotebook import tqdm as notebook_tqdm
            Consider upgrading to the new and improved v4 client instead!
            See here for usage: https://weaviate.io/developers/weaviate/client-libraries/python
            


True

# Schema

In this section, we will:
1. configure the data schema for your data
2. select OpenAI module

> This is the second and final step, which requires OpenAI specific configuration.
> After this step, the rest of instructions wlll only touch on Weaviate, as the OpenAI tasks will be handled automatically.


## What is a schema

In Weaviate you create __schemas__ to capture each of the entities you will be searching.

A schema is how you tell Weaviate:
* what embedding model should be used to vectorize the data
* what your data is made of (property names and types)
* which properties should be vectorized and indexed

In this cookbook we will use a dataset for `Articles`, which contains:
* `title`
* `content`
* `url`

We want to vectorize `title` and `content`, but not the `url`.

To vectorize and query the data, we will use `text-embedding-3-small`. For Q&A we will use `gpt-3.5-turbo-instruct`.

In [8]:
# Clear up the schema, so that we can recreate it
client.schema.delete_all()
client.schema.get()

# Define the Schema object to use `text-embedding-3-small` on `title` and `content`, but skip it for `url`
article_schema = {
    "class": "Article",
    "description": "A collection of articles",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "text2vec-openai": {
          "model": "ada",
          "modelVersion": "002",
          "type": "text"
        }, 
        "qna-openai": {
          "model": "gpt-3.5-turbo-instruct",
          "maxTokens": 16,
          "temperature": 0.0,
          "topP": 1,
          "frequencyPenalty": 0.0,
          "presencePenalty": 0.0
        }
    },
    "properties": [{
        "name": "title",
        "description": "Title of the article",
        "dataType": ["string"]
    },
    {
        "name": "content",
        "description": "Contents of the article",
        "dataType": ["text"]
    },
    {
        "name": "url",
        "description": "URL to the article",
        "dataType": ["string"],
        "moduleConfig": { "text2vec-openai": { "skip": True } }
    }]
}

# add the Article schema
client.schema.create_class(article_schema)

# get the schema to make sure it worked
client.schema.get()

{'classes': [{'class': 'Article',
   'description': 'A collection of articles',
   'invertedIndexConfig': {'bm25': {'b': 0.75, 'k1': 1.2},
    'cleanupIntervalSeconds': 60,
    'stopwords': {'additions': None, 'preset': 'en', 'removals': None}},
   'moduleConfig': {'qna-openai': {'frequencyPenalty': 0.0,
     'maxTokens': 16,
     'model': 'gpt-3.5-turbo-instruct',
     'presencePenalty': 0.0,
     'temperature': 0.0,
     'topP': 1},
    'text2vec-openai': {'baseURL': 'https://api.openai.com',
     'model': 'ada',
     'modelVersion': '002',
     'type': 'text',
     'vectorizeClassName': True}},
   'multiTenancyConfig': {'enabled': False},
   'properties': [{'dataType': ['text'],
     'description': 'Title of the article',
     'indexFilterable': True,
     'indexSearchable': True,
     'moduleConfig': {'text2vec-openai': {'skip': False,
       'vectorizePropertyName': False}},
     'name': 'title',
     'tokenization': 'whitespace'},
    {'dataType': ['text'],
     'description': 'C

## Import data

In this section we will:
1. load the Simple Wikipedia dataset
2. configure Weaviate Batch import (to make the import more efficient)
3. import the data into Weaviate

> Note: <br/>
> Like mentioned before. We don't need to manually vectorize the data.<br/>
> The [text2vec-openai](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-openai) module will take care of that.

In [9]:
### STEP 1 - load the dataset

from datasets import load_dataset
from typing import List, Iterator

# We'll use the datasets library to pull the Simple Wikipedia dataset for embedding
dataset = list(load_dataset("wikipedia", "20220301.simple")["train"])

# For testing, limited to 2.5k articles for demo purposes
dataset = dataset[:2_500]

# Limited to 25k articles for larger demo purposes
# dataset = dataset[:25_000]

# for free OpenAI acounts, you can use 50 objects
# dataset = dataset[:50]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Downloading builder script: 100%|██████████| 36.7k/36.7k [00:00<00:00, 4.42MB/s]
Downloading readme: 100%|██████████| 16.0k/16.0k [00:00<00:00, 19.6MB/s]
Downloading data: 100%|██████████| 134M/134M [00:15<00:00, 8.56MB/s] 
Generating train split: 100%|██████████| 205328/205328 [00:00<00:00, 775976.36 examples/s]


In [10]:
### Step 2 - configure Weaviate Batch, with
# - starting batch size of 100
# - dynamically increase/decrease based on performance
# - add timeout retries if something goes wrong

client.batch.configure(
    batch_size=10, 
    dynamic=True,
    timeout_retries=3,
#   callback=None,
)

<weaviate.batch.crud_batch.Batch at 0x11eda1340>

In [11]:
### Step 3 - import data

print("Importing Articles")

counter=0

with client.batch as batch:
    for article in dataset:
        if (counter %10 == 0):
            print(f"Import {counter} / {len(dataset)} ")

        properties = {
            "title": article["title"],
            "content": article["text"],
            "url": article["url"]
        }
        
        batch.add_data_object(properties, "Article")
        counter = counter+1

print("Importing Articles complete")

Importing Articles
Import 0 / 2500 
Import 10 / 2500 
Import 20 / 2500 
Import 30 / 2500 
Import 40 / 2500 
Import 50 / 2500 
Import 60 / 2500 
Import 70 / 2500 
Import 80 / 2500 
Import 90 / 2500 
Import 100 / 2500 
Import 110 / 2500 
Import 120 / 2500 
Import 130 / 2500 
Import 140 / 2500 
Import 150 / 2500 
Import 160 / 2500 
Import 170 / 2500 
Import 180 / 2500 
Import 190 / 2500 
Import 200 / 2500 
Import 210 / 2500 
Import 220 / 2500 
Import 230 / 2500 
Import 240 / 2500 
Import 250 / 2500 
Import 260 / 2500 
Import 270 / 2500 
Import 280 / 2500 
Import 290 / 2500 
Import 300 / 2500 
Import 310 / 2500 
Import 320 / 2500 
Import 330 / 2500 
Import 340 / 2500 
Import 350 / 2500 
Import 360 / 2500 
Import 370 / 2500 
Import 380 / 2500 
Import 390 / 2500 
Import 400 / 2500 
Import 410 / 2500 
Import 420 / 2500 
Import 430 / 2500 
Import 440 / 2500 
Import 450 / 2500 
Import 460 / 2500 
Import 470 / 2500 
Import 480 / 2500 
Import 490 / 2500 
Import 500 / 2500 
Import 510 / 2500 
Impo

In [28]:
# Test that all data has loaded – get object count
result = (
    client.query.aggregate("Article")
    .with_fields("meta { count }")
    .do()
)
print("Object count: ", result["data"]["Aggregate"]["Article"], "\n")

Object count:  [{'meta': {'count': 2494}}] 



In [29]:
# Test one article has worked by checking one object
test_article = (
    client.query
    .get("Article", ["title", "url", "content"])
    .with_limit(1)
    .do()
)["data"]["Get"]["Article"][0]

print(test_article['title'])
print(test_article['url'])
print(test_article['content'])

Donald Sutherland
https://simple.wikipedia.org/wiki/Donald%20Sutherland
Donald McNichol Sutherland OC (born July 17, 1935) is a Canadian actor. He has appeared in more than 100 movie and television shows.

Sutherland is known for his roles in Fellini's Casanova, Klute, Don't Look Now, Invasion of the Body Snatchers, JFK, Ordinary People, Pride & Prejudice, and The Hunger Games. He is the father of actor Kiefer Sutherland.

Early life
Sutherland was born in Saint John, New Brunswick, Canada. His ancestry includes Scottish, as well as German and English. When Sutherland was a child, he had rheumatic fever, hepatitis and poliomyelitis. He studied at Victoria College and at University of Toronto. He studied acting London Academy of Music and Dramatic Art. Sutherland started off working as a radio DJ at the age of 14.

Career

Sutherland's acting career began in 1962 with a small role in the television series The Avengers. He then starred in some major roles in movies such as Dr. Terror's H

### Question Answering on the Data

As above, we'll fire some queries at our new Index and get back results based on the closeness to our existing vectors

In [26]:
def qna(queryText, collection_name):
    
    properties = [
        "title", "content", "url",
        "_additional { answer { hasAnswer property result startPosition endPosition } distance }"
    ]

    ask = {
        "question": queryText,
        "properties": ["content"]
    }

    result = (
        client.query
        .get(collection_name, properties)
        .with_ask(ask)
        .with_limit(1)
        .do()
    )
    
    # Check for errors
    if ("errors" in result):
        print ("\033[91mYou probably have run out of OpenAI API calls for the current minute – the limit is set at 60 per minute.")
        raise Exception(result["errors"][0]['message'])
    print(result["data"]["Get"][collection_name])
    return result["data"]["Get"][collection_name]

In [30]:
query_result = qna("Did Alanis Morissette win a Grammy?", "Article")

for i, article in enumerate(query_result):
    print(f"{i+1}. { article['_additional']['answer']['result']} (Distance: {round(article['_additional']['distance'],3) })")

[{'_additional': {'answer': {'endPosition': 0, 'hasAnswer': True, 'property': '', 'result': ' Yes, Alanis Morissette won four Grammy Awards for her album Jagged', 'startPosition': 0}, 'distance': 0.157219}, 'content': 'Alanis Nadine Morissette (born June 1, 1974) is a Grammy Award-winning Canadian-American singer and songwriter. She was born in Ottawa, Canada. She began singing in Canada as a teenager in 1990. In 1995, she became popular all over the world.\n\nAs a young child in Canada, Morissette began to act on television, including 5 episodes of the long-running series, You Can\'t Do That on Television. Her first album was released only in Canada in 1990.\n\nHer first international album was Jagged Little Pill, released in 1995. It was a rock-influenced album. Jagged has sold more than 33 million units globally. It became the best-selling debut album in music history. Her next album, Supposed Former Infatuation Junkie, was released in 1998. It was a success as well. Morissette took

In [31]:
query_result = qna("What is the capital of China?", "Article")

for i, article in enumerate(query_result):
    if article['_additional']['answer']['hasAnswer'] == False:
      print('No answer found')
    else:
      print(f"{i+1}. { article['_additional']['answer']['result']} (Distance: {round(article['_additional']['distance'],3) })")

[{'_additional': {'answer': {'endPosition': 218, 'hasAnswer': True, 'property': 'content', 'result': ' Beijing', 'startPosition': 210}, 'distance': 0.13843733}, 'content': 'Beijing is the capital of the People\'s Republic of China. The city used to be known as Peking.  It is in the northern and eastern parts of the country. It is the world\'s most populous capital city.\n\nThe city of Beijing has played a very important role in the development of China. Many people from different cities and countries come to Beijing to look for better chances to find work. Nearly 15 million people live there. In 2008 Beijing hosted the Summer Olympic Games, and will host the 2022 Winter Olympic Games. It will be the only city to host both.\n\nBeijing is well known for its ancient history.  Since the Jin Dynasty, Beijing has been the capital of several dynasties (especially the later ones), including the Yuan, Ming, and Qing. There are many places of historic interest in Beijing.\n\nName\n\nThe Mandarin

Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo.