# | NLP | LLM | VectorDB | Pinecone |

## Natural Language Processing (NLP) and Large Language Models (LLM) with Vector Database Pinecone

![Learning](https://t3.ftcdn.net/jpg/06/14/01/52/360_F_614015247_EWZHvC6AAOsaIOepakhyJvMqUu5tpLfY.jpg)


# <b>1 <span style='color:#78D118'>|</span> Overview</b>

In this section, we are going to try out another vector database, called Pinecone. It has a free tier which you need to sign up for to gain access below.

Pinecone is a cloud-based vector database that offers fast and scalable similarity search for high-dimensional data, with a focus on simplicity and ease of use. 

![Learning](https://d7umqicpi7263.cloudfront.net/img/product/738798c3-eeca-494a-a2a9-161bee9450b2/310429fb-2ce8-4186-adea-cc619511ac3c.png)

## Library pre-requisites

- pinecone-client
    - pip install below
- Spark connector jar file
    - **IMPORTANT!!** Since we will be interacting with Spark by writing a Spark dataframe out to Pinecone, we need a Spark Connector.
    - You need to attach a Spark-Pinecone connector `s3://pinecone-jars/0.2.1/spark-pinecone-uberjar.jar` in the cluster you are using. Refer to this [documentation](https://docs.pinecone.io/docs/databricks#setting-up-a-spark-cluster) if you need more information. 



### Setup


In [1]:
pip install pinecone-client==2.2.2

Collecting pinecone-client==2.2.2
  Obtaining dependency information for pinecone-client==2.2.2 from https://files.pythonhosted.org/packages/98/17/3675b83dca0a032d2750bf04fbfdf78a6e46fa3056eefc2574cdd14661d9/pinecone_client-2.2.2-py3-none-any.whl.metadata
  Downloading pinecone_client-2.2.2-py3-none-any.whl.metadata (7.8 kB)
Collecting loguru>=0.5.0 (from pinecone-client==2.2.2)
  Obtaining dependency information for loguru>=0.5.0 from https://files.pythonhosted.org/packages/03/0a/4f6fed21aa246c6b49b561ca55facacc2a44b87d65b8b92362a8e99ba202/loguru-0.7.2-py3-none-any.whl.metadata
  Downloading loguru-0.7.2-py3-none-any.whl.metadata (23 kB)
Collecting dnspython>=2.0.0 (from pinecone-client==2.2.2)
  Obtaining dependency information for dnspython>=2.0.0 from https://files.pythonhosted.org/packages/f6/b4/0a9bee52c50f226a3cbfb54263d02bb421c7f2adc136520729c2c689c1e5/dnspython-2.4.2-py3-none-any.whl.metadata
  Downloading dnspython-2.4.2-py3-none-any.whl.metadata (4.9 kB)
Downloading pinecone

In [2]:
!pip install sparkmagic
!pip install pyspark

Collecting sparkmagic
  Downloading sparkmagic-0.21.0.tar.gz (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.3/45.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting hdijupyterutils>=0.6 (from sparkmagic)
  Downloading hdijupyterutils-0.21.0.tar.gz (5.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting autovizwidget>=0.6 (from sparkmagic)
  Downloading autovizwidget-0.21.0.tar.gz (9.0 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting pandas<2.0.0,>=0.17.1 (from sparkmagic)
  Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m84.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting requests_kerberos>=0.8.0 (from sparkmagic)
  Downloading requests_kerberos-0.14.0-py2.py3-none-any.whl (11 kB)
Collecting jupyter>=1 (from hdijupyterutil

In [3]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25ldone
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=a9ec71ab43962fc72898e59b295a096acb50661b099bc61c78dce25ddbd6c732
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2


In [4]:
cache_dir = "./cache"

In [5]:
import pandas as pd
pd.set_option('display.max_column', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_seq_items', None)
pd.set_option('display.max_colwidth', 500)
pd.set_option('expand_frame_repr', True)

# <b>2 <span style='color:#78D118'>|</span> Setting up your Pinecone</b>

Step 1: Go to their [home page](https://www.pinecone.io/) and click `Sign Up Free` on the top right corner. 
<br>
Step 2: Click on `Sign Up`. It's possible that you may not be able to sign up for a new account, depending on Pinecone's availability. 

<img src="https://files.training.databricks.com/images/pinecone_register.png" width=300>

Step 3: Once you are in the console, navigate to `API Keys` and copy the `Environment` and `Value` (this is your API key).

<img src="https://files.training.databricks.com/images/pinecone_credentials.png" width=500>


In [6]:
import os

os.environ["PINECONE_API_KEY"] = <FILL IN>
os.environ["PINECONE_ENV"] = <FILL IN>"

In [7]:
import pinecone

pinecone_api_key = os.environ["PINECONE_API_KEY"]
pinecone_env = os.environ["PINECONE_ENV"]

pinecone.init(api_key=pinecone_api_key, environment=pinecone_env)


  from tqdm.autonotebook import tqdm


# <b>3 <span style='color:#78D118'>|</span> Spark setup</b>

In [8]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession

# Spark in local mode else using S3
spark = SparkSession.builder.master("local[*]").getOrCreate()

df = (
    spark
    .read
    .option("header", True)
    .option("sep", ";")
    .format("csv")
    .load(
        f"/kaggle/input/topic-labeled-news-dataset/labelled_newscatcher_dataset.csv".replace(
            "/dbfs", "dbfs:"
        )
    )
    .withColumn("id", F.expr("uuid()"))
)
print("DataFrame Type:")
display(df)
print("\n")
print("DataFrame Contents:")
display(df.show(10))

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/28 21:36:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

DataFrame Type:


DataFrame[topic: string, link: string, domain: string, published_date: string, title: string, lang: string, id: string]



DataFrame Contents:
+-------+--------------------+--------------------+-------------------+--------------------+----+--------------------+
|  topic|                link|              domain|     published_date|               title|lang|                  id|
+-------+--------------------+--------------------+-------------------+--------------------+----+--------------------+
|SCIENCE|https://www.eurek...|      eurekalert.org|2020-08-06 13:59:45|A closer look at ...|  en|a7a32814-12e3-442...|
|SCIENCE|https://www.pulse...|            pulse.ng|2020-08-12 15:14:19|An irresistible s...|  en|ba7f94b5-cbad-476...|
|SCIENCE|https://www.expre...|       express.co.uk|2020-08-13 21:01:00|Artificial intell...|  en|a33f545d-3703-4a8...|
|SCIENCE|https://www.ndtv....|            ndtv.com|2020-08-03 22:18:26|Glaciers Could Ha...|  en|4daf3542-1374-4d8...|
|SCIENCE|https://www.thesu...|           thesun.ie|2020-08-12 19:54:36|Perseid meteor sh...|  en|aab4e21c-5af8-48a...|
|SCIENCE|https://interesti

None

# <b>4 <span style='color:#78D118'>|</span> Generate embedding and save</b>

For Pinecone, we need to generate the embeddings first and save it to a dataframe, before we can write it out to Pinecone for indexing. 

There are two ways of doing it: 
- 1. Using pandas DataFrame, apply the single-node embedding model, and upsert to Pinecone in batches
- 2. Using Spark Dataframe and use pandas UDFs to help us apply the embedding model on batches of data


## Method 1: Upsert to Pinecone in batches

In [9]:
pdf = df.limit(1000).toPandas()
display(pdf.head(10))

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-08/dbnl-acl080620.php,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel potential,en,a7a32814-12e3-4425-a046-15c137be1b40
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistible-scent-makes-locusts-swarm-study-finds/jy784jw,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, study finds",en,ba7f94b5-cbad-4761-9550-6ebf66cdac2f
2,SCIENCE,https://www.express.co.uk/news/science/1322607/artificial-intelligence-warning-machine-learning-algorithm-social-media-data,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know us better than we know ourselves,en,a33f545d-3703-4a8a-bc7d-340f157fd860
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could-have-sculpted-mars-valleys-study-2273648,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,4daf3542-1374-4d8c-ba57-051509bb5a2c
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-meteor-shower-tonight-time-uk-see/,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how to see the huge bright FIREBALLS over UK again tonight,en,aab4e21c-5af8-48a6-8737-762a0d22e7b7
5,SCIENCE,https://interestingengineering.com/nasa-releases-in-depth-map-of-beirut-explosion-damage,interestingengineering.com,2020-08-08 11:05:45,NASA Releases In-Depth Map of Beirut Explosion Damage,en,20e4043d-8082-4a73-a03c-88aa3bff8b31
6,SCIENCE,https://www.thequint.com/tech-and-auto/spacex-nasa-demo-2-rocket-launch-set-for-saturday-how-to-watch,thequint.com,2020-05-28 09:09:46,"SpaceX, NASA Demo-2 Rocket Launch Set for Saturday: How to Watch",en,59294720-c3c7-4ea4-95a6-c8099e11823a
7,SCIENCE,https://www.thespacereview.com/article/4003/1,thespacereview.com,2020-08-10 22:48:23,Orbital space tourism set for rebirth in 2021,en,3ceac204-7a26-4937-ad30-c2553a6f6321
8,SCIENCE,https://www.businessinsider.com/greenland-melting-ice-sheet-past-tipping-point-2020-8,businessinsider.com,2020-08-16 00:28:54,Greenland's melting ice sheet has 'passed the point of no return',en,367440f7-0220-4b14-9fa3-4fdea0e0c5d1
9,SCIENCE,https://www.thehindubusinessline.com/news/science/nasa-invites-engineering-students-to-help-harvest-water-on-mars-moon/article32352915.ece,thehindubusinessline.com,2020-08-14 07:43:25,"NASA invites engineering students to help harvest water on Mars, Moon",en,71dc9393-b48a-475a-9438-ccec6fb7b4fa


Note: Pinecone free tier only allows one index. If you have existing indices, you need to delete them before you are able create a new index.

We specify the similarity measure, embedding vector dimension within the index.

Read documentation on how to [create index here](https://docs.pinecone.io/reference/create_index/).


In [10]:
from sentence_transformers import SentenceTransformer

# We will use embeddings from this model to apply to our data
model = SentenceTransformer(
    "all-MiniLM-L6-v2", cache_folder=cache_dir
)  # Use a pre-cached model


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Delete the index if it already exists

In [11]:
pinecone_index_name = "news"

if pinecone_index_name in pinecone.list_indexes():
    pinecone.delete_index(pinecone_index_name)

Create the index.

We specify the index name (required), embedding vector dimension (required), and a custom similarity metric (cosine is the default) when creating our index.


In [12]:
# only create index if it doesn't exist
if pinecone_index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=pinecone_index_name,
        dimension=model.get_sentence_embedding_dimension(),
        metric="cosine",
    )

Now connect to the index

In [13]:
pinecone_index = pinecone.Index(pinecone_index_name)

When the index has been created, we can now upsert vectors of data records to the index. `Upsert` means that we are writing the vectors into the index. 

Refer to this [documentation page](https://docs.pinecone.io/docs/python-client#indexupsert) to look at example code and vectors.


In [20]:
from tqdm.auto import tqdm

batch_size = 1000

for i in tqdm(range(0, len(pdf["title"]), batch_size)):
    try:
        # find end of batch
        # Bug kaggle with min and Spark added __builtin__
        i_end = min(i + batch_size, len(pdf["title"]))
        # create IDs batch
        ids = [str(x) for x in range(i, i_end)]
        # create metadata batch
        metadata = [{"title": title} for title in pdf["title"][i:i_end]]
        # create embeddings
        embedding_title_batch = model.encode(pdf["title"][i:i_end]).tolist()
        # create records list for upsert
        records = zip(ids, embedding_title_batch, metadata)
        # upsert to Pinecone
        pinecone_index.upsert(vectors=records)

    except Exception as e:
        print(f"Error processing batch: {e}")

# check number of records in the index
pinecone_index.describe_index_stats()


  0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.01,
 'namespaces': {'': {'vector_count': 1000}},
 'total_vector_count': 1000}

Once the vectors are upserted, we can now query the index directly. 

Notice that it returns us the similarity score in the result too.

In [21]:
query = "fish"

# create the query vector
user_query = model.encode(query).tolist()

# submit the query to the Pinecone index
pinecone_answer = pinecone_index.query(user_query, top_k=3, include_metadata=True)

for result in pinecone_answer["matches"]:
    score_rounded = round(result['score'], 2)
    print(f"{score_rounded}, {result['metadata']['title']}")
    print("-" * 120)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

0.46, Cause Of Massive Fish Kill In Shinnecock Canal Not Clear - 27 East
------------------------------------------------------------------------------------------------------------------------
0.39, 'Secret' life of sharks: Study reveals their surprising social networks
------------------------------------------------------------------------------------------------------------------------
0.3, Oh No, Earthworm Jim
------------------------------------------------------------------------------------------------------------------------


## Method 2: Process with Spark and write to Pinecone with Spark

Now that we have seen how to `upsert` with Pinecone, you may be curious whether we can use Spark DataFrame Writer (just like Weaviate) to write the entire dataframe out in a single operation. The answer is yes -- we will now take a look at how to do that and a spoiler alert is that you will need to use a Spark connector too! 

We first need to write a mapping function to map the tokenizer and embedding model onto batches of rows within the Spark DataFrame. We will be using a type of [pandas UDFs](https://www.databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html), called scalar iterator UDFs. 

> The function takes and outputs an iterator of pandas.Series.

> The length of the whole output must be the same length of the whole input. Therefore, it can prefetch the data from the input iterator as long as the lengths of entire input and output are the same. The given function should take a single column as input.

> It is also useful when the UDF execution requires expensive initialization of some state. 

We load the model once per partition of data, not per [batch](https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#setting-arrow-batch-size), which is faster. 

For more documentation, refer [here](https://docs.databricks.com/udf/pandas.html).


In [16]:
import pandas as pd
from pyspark.sql.functions import pandas_udf
from sentence_transformers import SentenceTransformer
from typing import Iterator

@pandas_udf("array<float>")
def create_embeddings_with_transformers(
    sentences: Iterator[pd.Series],) -> Iterator[pd.Series]:
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    for batch in sentences:
        yield pd.Series(model.encode(batch).tolist())

import pyspark.sql.functions as F

transformer_type = "sentence-transformers/all-MiniLM-L6-v2"
embedding_spark_df = (
    df.limit(1000)
    .withColumn("values", create_embeddings_with_transformers("title")) 
    .withColumn("namespace", F.lit(None)) ## Pinecone free-tier does not support namespace
    .withColumn("sparse_values", F.lit(None)) ## required by Pinecone v2.0.1 release
    .withColumn("metadata", F.to_json(F.struct(F.col("topic").alias("TOPIC"))))
    # We select these columns because they are expected by the Spark-Pinecone connector
    .select("id", "values", "sparse_values", "namespace", "metadata")
)
display(embedding_spark_df)

DataFrame[id: string, values: array<float>, sparse_values: void, namespace: void, metadata: string]

Repeat the same step as in Method 1 above to delete and recreate the index. Again, we need to delete the index because Pinecone free tier only allows one index.

Note: This could take ~3 minutes. 


In [17]:
pinecone_index_name = "news"

if pinecone_index_name in pinecone.list_indexes():
    pinecone.delete_index(pinecone_index_name)

# only create index if it doesn't exist
model = SentenceTransformer(transformer_type)
if pinecone_index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=pinecone_index_name,
        dimension=model.get_sentence_embedding_dimension(),
        metric="cosine",
    )

# now connect to the index
pinecone_index = pinecone.Index(pinecone_index_name)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Instead of writing in batches, you can now use Spark DataFrame Writer to write the data out to Pinecone directly.

**IMPORTANT!!** You need to attach a Spark-Pinecone connector `s3://pinecone-jars/0.2.1/spark-pinecone-uberjar.jar` in the cluster you are using. Otherwise, this following command would fail. Refer to this [documentation](https://docs.pinecone.io/docs/databricks#setting-up-a-spark-cluster) and release note [here](https://github.com/pinecone-io/spark-pinecone/releases/tag/v0.2.1) if you need more information. 


In [None]:
(
    embedding_spark_df.write.option("pinecone.apiKey", pinecone_api_key)
    .option("pinecone.environment", pinecone_env)
    .option("pinecone.projectName", pinecone.whoami().projectname)
    .option("pinecone.indexName", pinecone_index_name)
    .format("io.pinecone.spark.pinecone.Pinecone")
    .mode("append")
    .save()
)