# Workshop 9

Today, we'll explore Pinecone, a vector database. The activities are based on [two tutorials](https://docs.pinecone.io/examples/notebooks) from the Pinecone developers.

A vector database works by storing each item with a vector. When you want to query the database, you provide a vector and the database returns items that have similar vectors to your query. This is a key ingredient for efficient RAG systems.

# Pre-Workshop Preparation

If you haven't already, set up accounts with Pinecone and OpenAI:

Pinecone - https://app.pinecone.io

Once you create an account they will ask you a few questions to get it set up. Choose Python as the language. The rest you can answer however you like.

OpenAI API - https://platform.openai.com/overview

Note, you will need to add a credit/debit card to your account in order to pay for some of the services we use. The expenses should be very small (assuming you just do the activities in this lab).

In [17]:
# api key from app.pinecone.io
pinecone_api_key = 'INSERT_YOUR_PINECONE_KEY_HERE'

# api key from platform.openai.com
openai_api_key = 'INSERT_YOUR_OPENAI_KEY_HERE'

Now, let's install the libraries we need:

In [18]:
!pip install -qU \
  openai==0.27.7 \
  pinecone==6.0.2 \
  pinecone-datasets==1.0.2 \
  sentence-transformers==3.4.1 \
  pinecone-notebooks==0.1.1 \
  pyarrow \
  hf_xet \
  tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Note, some Windows users have also found that they needed to isntall `pyarrow-11.0.0`

# In Workshop Activities

## Data Download

In this notebook we will use a pre-processed dataset from Pinecone Datasets.

If you are curious about what pre-processing they did. see [this notebook](https://github.com/pinecone-io/examples/blob/master/learn/search/semantic-search/semantic-search.ipynb).

In [19]:
from pinecone_datasets import load_dataset

dataset = load_dataset('quora_all-MiniLM-L6-bm25')

# we drop metadata as will use blob column
dataset.documents.drop(['metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)

# we will use 80K rows of the dataset between rows 240K -> 320K
dataset.documents.drop(dataset.documents.index[320_000:], inplace=True)
dataset.documents.drop(dataset.documents.index[:240_000], inplace=True)

# Print out a sample from the dataset to show what we are working with
dataset.head()

Loading documents parquet files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [02:02<00:00, 12.26s/it]


Unnamed: 0,id,values,sparse_values,metadata
240000,515997,"[-0.00531694, 0.06937869, -0.0092854, 0.003286...","{'indices': [845, 1657, 13677, 20780, 27058, 2...","{'text': ' Why is a ""law of sciences"" importan..."
240001,515998,"[-0.09243751, 0.065432355, -0.06946959, 0.0669...","{'indices': [2110, 6324, 9754, 13677, 15207, 2...",{'text': ' Is it possible to format a BitLocke...
240002,515999,"[-0.021924071, 0.032280188, -0.020190848, 0.07...","{'indices': [2110, 4949, 23579, 23758, 27058, ...",{'text': ' Can formatting a hard drive stress ...
240003,516000,"[-0.120020054, 0.024080949, 0.10693012, -0.018...","{'indices': [22014, 24734, 24773, 25791, 25991...",{'text': ' Are the new Samsung Galaxy J7 and J...
240004,516001,"[-0.095293395, -0.048446465, -0.017618902, -0....","{'indices': [307, 2110, 5785, 12969, 12971, 13...",{'text': ' I just watched an add for Indonesia...


In [20]:
print(len(dataset))

80000


## Creating an Index

Now the data is ready, we can set up our index to store it.

We begin by initializing our connection to Pinecone. This is where your API key is needed.

In [22]:
import os
from pinecone import Pinecone

# configure client
pc = Pinecone(api_key=pinecone_api_key)

Now we set up our index specification. This allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all available providers and regions [here](https://docs.pinecone.io/docs/projects).

In [23]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

Now we create a new index called `semantic-search-fast`. It's important that we align the index `dimension` and `metric` parameters with those required by the `MiniLM-L6` model.

In [24]:
import time

index_name = 'semantic-search-fast'

existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=384,  # dimensionality of minilm
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'dotproduct',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

Upsert the data to put it in the database (this can take 2-5 minutes):

In [25]:
from tqdm.auto import tqdm

for batch in tqdm(dataset.iter_documents(batch_size=500), total=160):
    index.upsert(batch)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [06:33<00:00,  2.46s/it]


## Making Queries

Now that our index is populated we can begin making queries. We are performing a semantic search for *similar questions*, so we should embed and search with another question.

Note that we use the same model as the one used above. That's critical - otherwise the vector spaces will not be meaningfully comparable.

In [26]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=device)
model

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Now let's query.

In [27]:
query = "which city has the highest population in the world?"

# create the query vector, store it into the Pinecone vector
xq = model.encode(query).tolist()

# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
xc

{'matches': [{'id': '69331',
              'metadata': {'text': " What's the world's largest city?"},
              'score': 0.785789192,
              'values': []},
             {'id': '69332',
              'metadata': {'text': ' What is the biggest city?'},
              'score': 0.727474,
              'values': []},
             {'id': '84749',
              'metadata': {'text': " What are the world's most advanced "
                                   'cities?'},
              'score': 0.709189594,
              'values': []},
             {'id': '109231',
              'metadata': {'text': ' Where is the most beautiful city in the '
                                   'world?'},
              'score': 0.696054876,
              'values': []},
             {'id': '109230',
              'metadata': {'text': ' What is the greatest, most beautiful city '
                                   'in the world?'},
              'score': 0.657444537,
              'values': []}],
 'namespace

In the returned response `xc` we can see the most relevant questions to our particular query — we don't have any exact matches but we can see that the returned questions are similar in the topics they are asking about. We can reformat this response to be a little easier to read:

In [28]:
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.79:  What's the world's largest city?
0.73:  What is the biggest city?
0.71:  What are the world's most advanced cities?
0.7:  Where is the most beautiful city in the world?
0.66:  What is the greatest, most beautiful city in the world?


These are good results, let's try and modify the words being used to see if we still surface similar results.

In [29]:
query = "which metropolis has the highest number of people?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.64:  What is the biggest city?
0.6:  What is the most dangerous city in USA?
0.59:  What's the world's largest city?
0.59:  What is the most dangerous city in USA? Why?
0.58:  What are the world's most advanced cities?


Here we used different terms in our query than that of the returned documents. We substituted **"city"** for **"metropolis"** and **"populated"** for **"number of people"**.

Despite these very different terms and *lack* of term overlap between query and returned documents — we get highly relevant results — this is the power of *semantic search*.

## Task 1

Try changing the model and querying again. You can find alternative models [here](https://sbert.net/docs/pretrained_models.html). Note that you will need to choose one with the same dimensionality (384). Clicking on the "info" symbol next to the model names will tell you information including their dimensionality.

Find a model that gives similar results and a model that gives different results.

In [30]:
# TODO

# Solution

query = "which metropolis has the highest number of people?"

# Similar
model2 = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2', device=device)
xq = model2.encode(query).tolist()
xc = index.query(vector=xq, top_k=5, include_metadata=True)
print("Similar")
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")
print()

# Different
model3 = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L3-v2', device=device)
xq = model3.encode(query).tolist()
xc = index.query(vector=xq, top_k=5, include_metadata=True)
print("Different")
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Similar
0.47:  What is the biggest city?
0.47:  How do you measure the pollution rate in a city with a population of 6 million people?
0.46:  Which city in the world has the lowest crime rate and why?
0.46:  Which city in India has a large Parsi population?
0.46:  What are the world's most advanced cities?



Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Different
1.46:  How would people compare the various metropolitan cities in India on all the possible parameters?
1.38:  Are Zillow, Redfin, and Trulia competitors? Who is better-positioned?
1.36:  What are the world's most advanced cities?
1.36:  Is there a way to leverage Google Maps or other tools to determine highest volume commute destinations for a neighborhood?
1.34:  Does London have lower violent crime rates than New York?


# Retrieval Enhanced Generative Question Answering

Next, we will see how these queries can be used with an LLM to generate better outputs.

We will again use data that has already been prepared (for details, see [this notebook](https://github.com/pinecone-io/examples/blob/master/learn/generation/openai/gen-qa-openai.ipynb)).

In [31]:
from pinecone_datasets import load_dataset

dataset = load_dataset('youtube-transcripts-text-embedding-ada-002')

# we drop sparse_values as they are not needed for this example
dataset.documents.drop(['metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)

# Print a sample of the data
dataset.head()

Loading documents parquet files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:39<00:00, 39.35s/it]


Unnamed: 0,id,values,sparse_values,metadata
0,35Pdoyi6ZoQ-t0.0,"[-0.010402066633105278, -0.018359748646616936,...",,"{'channel_id': 'UCv83tO5cePwHMt1952IVVHw', 'en..."
1,35Pdoyi6ZoQ-t18.48,"[-0.011849376372992992, 0.0007984379190020263,...",,"{'channel_id': 'UCv83tO5cePwHMt1952IVVHw', 'en..."
2,35Pdoyi6ZoQ-t32.36,"[-0.014534404501318932, -0.0003158661129418760...",,"{'channel_id': 'UCv83tO5cePwHMt1952IVVHw', 'en..."
3,35Pdoyi6ZoQ-t51.519999999999996,"[-0.011597747914493084, -0.007550035137683153,...",,"{'channel_id': 'UCv83tO5cePwHMt1952IVVHw', 'en..."
4,35Pdoyi6ZoQ-t67.28,"[-0.015879768878221512, 0.0030445053707808256,...",,"{'channel_id': 'UCv83tO5cePwHMt1952IVVHw', 'en..."


Again, we will set up a pinecone database:

In [32]:
index_name = 'gen-qa-openai-fast'
# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of text-embedding-ada-002
        metric='cosine',
        spec=spec
    )
# connect to index
index = pc.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

As in the previous section, we'll insert the data into the database (this can take 5-10 minutes):

In [33]:
for batch in dataset.iter_documents(batch_size=100):
    index.upsert(batch)

Now we've added all of our langchain docs to the index. With that we can move on to retrieval and then answer generation.

## Retrieval

To search through our documents we first need to create a query vector `xq`. Using `xq` we will retrieve the most relevant chunks from the LangChain docs. To create that query vector we must initialize a `text-embedding-ada-002` embedding model with OpenAI. For this, you need an [OpenAI API key](https://platform.openai.com/).

In [34]:
import openai

openai.api_key = openai_api_key

embed_model = "text-embedding-ada-002"

In [35]:
query = (
    "Which training method should I use for sentence transformers when " +
    "I only have pairs of related sentences?"
)

res = openai.Embedding.create(
    input=[query],
    engine=embed_model
)

# retrieve from Pinecone
xq = res['data'][0]['embedding']

# get relevant contexts (including the questions)
res = index.query(vector=xq, top_k=2, include_metadata=True)

res

{'matches': [{'id': 'pNvujJ1XyeQ-t418.88',
              'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
                           'end': 568.0,
                           'published': '2021-11-24 16:24:24 UTC',
                           'start': 418.0,
                           'text': 'pairs of related sentences you can go '
                                   'ahead and actually try training or '
                                   'fine-tuning using NLI with multiple '
                                   "negative ranking loss. If you don't have "
                                   'that fine. Another option is that you have '
                                   'a semantic textual similarity data set or '
                                   'STS and what this is is you have so you '
                                   'have sentence A here, sentence B here and '
                                   'then you have a score from from 0 to 1 '
                                   'tha

We write some functions to handle the retrieval and completion steps:

In [39]:
limit = 3750

import time

def retrieve(query):
    res = openai.Embedding.create(
        input=[query],
        engine=embed_model
    )

    # retrieve from Pinecone
    xq = res['data'][0]['embedding']

    # get relevant contexts
    contexts = []
    time_waited = 0
    while (len(contexts) < 3 and time_waited < 60 * 12):
        res = index.query(vector=xq, top_k=3, include_metadata=True)
        contexts = contexts + [
            x['metadata']['text'] for x in res['matches']
        ]
        print(f"Retrieved {len(contexts)} contexts, sleeping for 15 seconds...")
        time.sleep(15)
        time_waited += 15

    if time_waited >= 60 * 12:
        print("Timed out waiting for contexts to be retrieved.")
        contexts = ["No contexts retrieved. Try to answer the question yourself!"]


    # build our prompt with the retrieved contexts included
    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    # append contexts until hitting limit
    for i in range(1, len(contexts)):
        if len("\n\n---\n\n".join(contexts[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts[:i-1]) +
                prompt_end
            )
            break
        elif i == len(contexts)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts) +
                prompt_end
            )
    return prompt


def complete(prompt):
    # instructions
    sys_prompt = "You are a helpful assistant that always answers questions."
    # query text-davinci-003
    res = openai.ChatCompletion.create(
        model='gpt-4.1',
        messages=[
            {"role": "system", "content": sys_prompt},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )
    return res['choices'][0]['message']['content'].strip()

In [40]:
# first we retrieve relevant items from Pinecone
query_with_contexts = retrieve(query)
query_with_contexts

Retrieved 3 contexts, sleeping for 15 seconds...


"Answer the question based on the context below.\n\nContext:\npairs of related sentences you can go ahead and actually try training or fine-tuning using NLI with multiple negative ranking loss. If you don't have that fine. Another option is that you have a semantic textual similarity data set or STS and what this is is you have so you have sentence A here, sentence B here and then you have a score from from 0 to 1 that tells you the similarity between those two scores and you would train this using something like cosine similarity loss. Now if that's not an option and your focus or use case is on building a sentence transformer for another language where there is no current sentence transformer you can use multilingual parallel data. So what I mean by that is so parallel data just means translation pairs so if you have for example a English sentence and then you have another language here so it can it can be anything I'm just going to put XX and that XX is your target language you can 

In [41]:
# then we complete the context-infused query
complete(query_with_contexts)

"If you only have pairs of related sentences (i.e., positive pairs), you should use the Natural Language Inference (NLI) training method with multiple negatives ranking loss. This approach allows you to train or fine-tune a sentence transformer using just the related (entailment) pairs, even if you don't have contradictory or neutral pairs. This method is effective and commonly used when only positive sentence pairs are available."

And we get a pretty great answer straight away, specifying to use _multiple-rankings loss_ (also called _multiple negatives ranking loss_).



## Task 2

Try adjusting the number of contexts down to 1, to see the impact on retrieval quality.

In [42]:
# TODO

# Solution
def retrieve_1_context(query):
    res = openai.Embedding.create(
        input=[query],
        engine=embed_model
    )
    xq = res['data'][0]['embedding']
    contexts = []
    time_waited = 0
    while (len(contexts) < 1 and time_waited < 60 * 12): # change the 3 to 1
        res = index.query(vector=xq, top_k=1, include_metadata=True) # using top_k=1
        contexts = contexts + [
            x['metadata']['text'] for x in res['matches']
        ]
        print(f"Retrieved {len(contexts)} contexts, sleeping for 15 seconds...")
        time.sleep(15)
        time_waited += 15

    if time_waited >= 60 * 12:
        print("Timed out waiting for contexts to be retrieved.")
        contexts = ["No contexts retrieved. Try to answer the question yourself!"]

    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    joined_context = "\n\n---\n\n".join(contexts)
    if len(joined_context) > limit:
        joined_context = joined_context[:limit]  # truncate
    prompt = prompt_start + joined_context + prompt_end
    return prompt

query_with_contexts = retrieve_1_context(query)
print(query_with_contexts)
complete(query_with_contexts)

Retrieved 3 contexts, sleeping for 15 seconds...
Answer the question based on the context below.

Context:


Question: Which training method should I use for sentence transformers when I only have pairs of related sentences?
Answer:


'If you only have pairs of related sentences (for example, sentence pairs that are paraphrases or semantically similar), the best training method for sentence transformers is **contrastive learning** using a **pairwise loss function**. The most common and effective loss functions in this scenario are:\n\n- **Cosine Similarity Loss** (also called CosineEmbeddingLoss)\n- **Contrastive Loss**\n- **Triplet Loss** (if you can generate negative pairs)\n\nFor sentence-transformers, the **Cosine Similarity Loss** is often used when you have pairs of sentences and a label indicating whether they are similar (1) or not (0). If you only have positive pairs (related sentences), you can use **MultipleNegativesRankingLoss**, which automatically treats other sentences in the batch as negatives.\n\n**In summary:**  \nUse **Cosine Similarity Loss** or **MultipleNegativesRankingLoss** with your sentence pairs. These are specifically designed for training sentence transformers with pairs of related sente

## Pack up

Once you're done with the workshop, delete the indices to save resources:

In [43]:
pc.delete_index('gen-qa-openai-fast')
pc.delete_index('semantic-search-fast')