# | NLP | LLM | VectorDB | Weaviate |

## Natural Language Processing (NLP) and Large Language Models (LLM) with Vector Database Weaviate

![Learning](https://t3.ftcdn.net/jpg/06/14/01/52/360_F_614015247_EWZHvC6AAOsaIOepakhyJvMqUu5tpLfY.jpg)


# <b>1 <span style='color:#78D118'>|</span> Overview</b>

 In this notebook, we will use Weaviate as our vector database. We will then write the embedding vectors out to Weaviate and query for similar documents. Weaviate provides customization options, such as to incorporate Product Quantization or not (refer [here](https://weaviate.io/developers/weaviate/concepts/vector-index#hnsw-with-product-quantizationpq)). 
 
[Zilliz](https://zilliz.com/) has an enterprise offering for Weaviate.

<img src="https://mms.businesswire.com/media/20220824005057/en/1550928/5/Logo-_Colorful.jpg" alt="Learning" width="50%">


## Library pre-requisites

- weaviate-client
    - pip install below
- Spark connector jar file
    - **IMPORTANT!!** Since we will be interacting with Spark by writing a Spark dataframe out to Pinecone, we need a Spark Connector.
    - You need to attach a Spark-Pinecone connector `s3://pinecone-jars/0.2.1/spark-pinecone-uberjar.jar` in the cluster you are using. Refer to this [documentation](https://docs.pinecone.io/docs/databricks#setting-up-a-spark-cluster) if you need more information. 



### Setup


In [12]:
!pip install weaviate-client==3.19.1

Collecting weaviate-client==3.19.1
  Obtaining dependency information for weaviate-client==3.19.1 from https://files.pythonhosted.org/packages/d6/3e/daa4e3fdd5dd3499ab12c262c21f3a28a9b868f73341703b791fa86e1b88/weaviate_client-3.19.1-py3-none-any.whl.metadata
  Downloading weaviate_client-3.19.1-py3-none-any.whl.metadata (3.3 kB)
Collecting requests<2.29.0,>=2.28.0 (from weaviate-client==3.19.1)
  Downloading requests-2.28.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting validators<=0.21.0,>=0.18.2 (from weaviate-client==3.19.1)
  Obtaining dependency information for validators<=0.21.0,>=0.18.2 from https://files.pythonhosted.org/packages/ad/50/18dbf2ac594234ee6249bfe3425fa424c18eeb96f29dcd47f199ed6c51bc/validators-0.21.0-py3-none-any.whl.metadata
  Downloading validators-0.21.0-py3-none-any.whl.metadata (2.6 kB)
Collecting authlib>=1.1.0 (from weaviate-client==3.19.1)
  Ob

In [13]:
!pip install sparkmagic
!pip install pyspark

Collecting sparkmagic
  Downloading sparkmagic-0.21.0.tar.gz (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.3/45.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting hdijupyterutils>=0.6 (from sparkmagic)
  Downloading hdijupyterutils-0.21.0.tar.gz (5.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting autovizwidget>=0.6 (from sparkmagic)
  Downloading autovizwidget-0.21.0.tar.gz (9.0 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting pandas<2.0.0,>=0.17.1 (from sparkmagic)
  Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m86.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting requests_kerberos>=0.8.0 (from sparkmagic)
  Downloading requests_kerberos-0.14.0-py2.py3-none-any.whl (11 kB)
Collecting jupyter>=1 (from hdijupyterutil

In [7]:
!pip install pydantic==1.8.1

Collecting pydantic==1.8.1
  Downloading pydantic-1.8.1-py3-none-any.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.3/125.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pydantic
  Attempting uninstall: pydantic
    Found existing installation: pydantic 2.5.2
    Uninstalling pydantic-2.5.2:
      Successfully uninstalled pydantic-2.5.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
confection 0.1.4 requires pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4, but you have pydantic 1.8.1 which is incompatible.
fastapi 0.101.1 requires pydantic!=1.8,!=1.8.1,!=2.0.0,!=2.0.1,!=2.1.0,<3.0.0,>=1.7.4, but you have pydantic 1.8.1 which is incompatible.
openai 1.6.1 requires pydantic<3,>=1.9.0, but you have pydantic 1.8.1 which is incompatible.
spacy 3.7.2 requires pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4, but you have

In [4]:
!pip install openai==1.6.1 httpcore==1.0.2 httpx==0.26.0 typing-extensions==4.9.0



In [4]:
cache_dir = "./cache"

In [5]:
import pandas as pd
pd.set_option('display.max_column', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_seq_items', None)
pd.set_option('display.max_colwidth', 500)
pd.set_option('expand_frame_repr', True)

# <b>2 <span style='color:#78D118'>|</span> Setting up your Weaviate</b>

[Weaviate](https://weaviate.io/) is an open-source persistent and fault-tolerant [vector database](https://weaviate.io/developers/weaviate/concepts/storage). It integrates with a variety of tools, including OpenAI and Hugging Face Transformers. You can refer to their [documentation here](https://weaviate.io/developers/weaviate/quickstart).

Before we could proceed, you need your own Weaviate Network. To start your own network, visit the [homepage](https://weaviate.io/). 

Step 1: Click on `Start Free` 

<img src="https://files.training.databricks.com/images/weaviate_homepage.png" width=500>

Step 2: You will be brought to this [Console page](https://console.weaviate.cloud/). If this is your first time using Weaviate, click `Register here` and pass in your credentials.

<img src="https://files.training.databricks.com/images/weaviate_register.png" width=500>

Step 3: Click on `Create cluster` and select `Free sandbox`. Provide your cluster name. For simplicity, we will toggle `enable authentication` to be `No`. Then, hit `Create`. 

<img src="https://files.training.databricks.com/images/weaviate_create_cluster.png" width=900>

Step 4: Click on `Details` and copy the `Cluster URL` and paste in the cell below.


We will use embeddings from OpenAI,  so we will need a token from OpenAI API

Steps:
1. You need to [create an account](https://platform.openai.com/signup) on OpenAI. 
2. Generate an OpenAI [API key here](https://platform.openai.com/account/api-keys). 

Note: OpenAI does not have a free option, but it gives you 5€ as credit. Once you have exhausted your 5€ credit, you will need to add your payment method. You will be [charged per token usage](https://openai.com/pricing). **IMPORTANT**: It's crucial that you keep your OpenAI API key to yourself. If others have access to your OpenAI key, they will be able to charge their usage to your account! 


In [9]:
import os

os.environ["OPENAI_API_KEY"] = "<FILL IN>"
os.environ["WEAVIATE_NETWORK"] = "<FILL IN>"

In [10]:
import openai

openai.api_key = os.environ["OPENAI_API_KEY"]
weaviate_network = os.environ["WEAVIATE_NETWORK"]

In [14]:
import weaviate

client = weaviate.Client(
    weaviate_network, additional_headers={"X-OpenAI-Api-Key": openai.api_key}
)
client.is_ready()

True

# <b>3 <span style='color:#78D118'>|</span> Spark setup</b>

#### Dataset


In this section, we are going to use the data on <a href="https://newscatcherapi.com/" target="_blank">news topics collected by the NewsCatcher team</a>, who collects and indexes news articles and release them to the open-source community. The dataset can be downloaded from <a href="https://www.kaggle.com/kotartemiy/topic-labeled-news-dataset" target="_blank">Kaggle</a>.


In [15]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession

# Spark in local mode else using S3
spark = SparkSession.builder.master("local[*]").getOrCreate()

df = (
    spark
    .read
    .option("header", True)
    .option("sep", ";")
    .format("csv")
    .load(
        f"/kaggle/input/topic-labeled-news-dataset/labelled_newscatcher_dataset.csv".replace(
            "/dbfs", "dbfs:"
        )
    )
)
print("DataFrame Type:")
display(df)
print("\n")
print("DataFrame Contents:")
display(df.show(10))

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/29 10:30:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


DataFrame Type:


DataFrame[topic: string, link: string, domain: string, published_date: string, title: string, lang: string]



DataFrame Contents:
+-------+--------------------+--------------------+-------------------+--------------------+----+
|  topic|                link|              domain|     published_date|               title|lang|
+-------+--------------------+--------------------+-------------------+--------------------+----+
|SCIENCE|https://www.eurek...|      eurekalert.org|2020-08-06 13:59:45|A closer look at ...|  en|
|SCIENCE|https://www.pulse...|            pulse.ng|2020-08-12 15:14:19|An irresistible s...|  en|
|SCIENCE|https://www.expre...|       express.co.uk|2020-08-13 21:01:00|Artificial intell...|  en|
|SCIENCE|https://www.ndtv....|            ndtv.com|2020-08-03 22:18:26|Glaciers Could Ha...|  en|
|SCIENCE|https://www.thesu...|           thesun.ie|2020-08-12 19:54:36|Perseid meteor sh...|  en|
|SCIENCE|https://interesti...|interestingengine...|2020-08-08 11:05:45|NASA Releases In-...|  en|
|SCIENCE|https://www.thequ...|        thequint.com|2020-05-28 09:09:46|SpaceX, NASA Demo...|  en

None

# <b>4 <span style='color:#78D118'>|</span> Dataset into Weaviate</b>

We are going to store this dataset in the Weaviate database. To do that, we first need to define a schema. A schema is where we define classes, class properties, data types, and vectorizer modules we would like to use. 

In the schema below, notice that:

- We capitalize the first letter of `class_name`. This is Weaviate's rule. 
- We specify data types within `properties`
- We use `text2vec-openai` as the vectorizer. 
- You can also choose to upload your own vectors (refer to [docs here](https://weaviate.io/developers/weaviate/api/rest/objects#with-a-custom-vector)) or create a class without any vectors (but we won't be able to perform similarity search after).

[Reference documentation here](https://weaviate.io/developers/weaviate/tutorials/schema)


## Step 1: Schema

In [16]:
class_name = "News"
class_obj = {
    "class": class_name,
    "description": "News topics collected by NewsCatcher",
    "properties": [
        {"name": "topic", "dataType": ["string"]},
        {"name": "link", "dataType": ["string"]},
        {"name": "domain", "dataType": ["string"]},
        {"name": "published_date", "dataType": ["string"]},
        {"name": "title", "dataType": ["string"]},
        {"name": "lang", "dataType": ["string"]},
    ],
    "vectorizer": "text2vec-openai",
}


In [17]:
# If the class exists before, we will delete it first
if client.schema.exists(class_name):
    print("Deleting existing class...")
    client.schema.delete_class(class_name)

print(f"Creating class: '{class_name}'")
client.schema.create_class(class_obj)

Creating class: 'News'


In [19]:
import json
print(json.dumps(client.schema.get(class_name), indent=4))

{
    "class": "News",
    "description": "News topics collected by NewsCatcher",
    "invertedIndexConfig": {
        "bm25": {
            "b": 0.75,
            "k1": 1.2
        },
        "cleanupIntervalSeconds": 60,
        "stopwords": {
            "additions": null,
            "preset": "en",
            "removals": null
        }
    },
    "moduleConfig": {
        "text2vec-openai": {
            "baseURL": "https://api.openai.com",
            "model": "ada",
            "modelVersion": "002",
            "type": "text",
            "vectorizeClassName": true
        }
    },
    "multiTenancyConfig": {
        "enabled": false
    },
    "properties": [
        {
            "dataType": [
                "text"
            ],
            "indexFilterable": true,
            "indexSearchable": true,
            "moduleConfig": {
                "text2vec-openai": {
                    "skip": false,
                    "vectorizePropertyName": false
                }
   

## Step 2: Save

Now that the class is created, we are going to write our dataframe to the class. 

**IMPORTANT!!** Since we are writing a Spark DataFrame out, we need a Spark Connector to Weaviate. You need to [download the Spark connector jar file](https://github.com/weaviate/spark-connector#download-jar-from-github) and [upload to your Databricks cluster](https://github.com/weaviate/spark-connector#using-the-jar-in-databricks) before running the next cell. If you do not do this, the next cell *will fail*.


In [None]:
(
    df.limit(100)
    .write.format("io.weaviate.spark.Weaviate")
    .option("scheme", "http")
    .option("host", weaviate_network.split("https://")[1])
    .option("header:X-OpenAI-Api-Key", openai.api_key)
    .option("className", class_name)
    .mode("append")
    .save()
)

Let's check if the data is indeed populated. You can run either the following command or go to 
`https://{insert_your_cluster_url_here}/v1/objects` 

You should be able to see the data records, rather than null objects.


In [None]:
client.query.get("News", ["topic"]).do()

Looks like the data is populated. We can proceed further and do a query search. We are going to search for any news titles related to `locusts`. Additionally, we are going to add a filter statement, where the topic of the news has to be `SCIENCE`. Notice that we don't have to carry out the step of converting `locusts` into embeddings ourselves because we have included a vectorizer within the class earlier on.

We will use `with_near_text` to specify the text we would like to query similar titles for. By default, Weaviate uses cosine distance to determine similar objects. Refer to [distance documentation here](https://weaviate.io/developers/weaviate/config-refs/distances#available-distance-metrics).


In [None]:
where_filter = {
    "path": ["topic"],
    "operator": "Equal",
    "valueString": "SCIENCE",
}

# We are going to search for any titles related to locusts
near_text = {"concepts": "locust"}
(
    client.query.get(class_name, ["topic", "domain", "title"])
    .with_where(where_filter)
    .with_near_text(near_text)
    .with_limit(2)
    .do()
)

Alternatively, if you wish to supply your own embeddings at query time, you can do that too. Since embeddings are vectors, we will use `with_near_vector` instead.

In the code block below, we additionally introduce a `distance` parameter. The lower the distance score is, the closer the vectors are to each other. Read more about the distance thresholds [here](https://weaviate.io/developers/weaviate/config-refs/distances#available-distance-metrics).


In [None]:
import openai

model = "text-embedding-ada-002"
openai_object = openai.Embedding.create(input=["locusts"], model=model)

openai_embedding = openai_object["data"][0]["embedding"]

(
    client.query.get("News", ["topic", "domain", "title"])
    .with_where(where_filter)
    .with_near_vector(
        {
            "vector": openai_embedding,
            "distance": 0.7,  # this sets a threshold for distance metric
        }
    )
    .with_limit(2)
    .do()
)