[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb)

# Semantic Search

In this walkthrough we will see how to use Pinecone for semantic search. To begin we must install the required prerequisite libraries:

In [None]:
!pip install -qU \
  "pinecone-client[grpc]"==2.2.1 \
  pinecone-datasets=='0.5.0rc11' \
  sentence-transformers==2.2.2

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/177.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/177.2 kB[0m [31m969.8 kB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.4/300.4 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m1

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Data Download

In this notebook we will skip the data preparation steps as they can be very time consuming and jump straight into it with the prebuilt dataset from *Pinecone Datasets*. If you'd rather see how it's all done, please refer to [this notebook](https://colab.research.google.com/github/pinecone-io/examples/blob/master/search/semantic-search/semantic-search.ipynb).

Let's go ahead and download the dataset.

In [None]:
from pinecone_datasets import load_dataset

dataset = load_dataset('quora_all-MiniLM-L6-bm25')

In [None]:
dataset.head()

Unnamed: 0,id,values,sparse_values,metadata,blob
0,1,"[0.06814987, -0.039664183, -0.06096721, 0.0074...","{'indices': [7096, 8508, 13677, 23041, 24734, ...",,{'text': ' What is the step by step guide to i...
1,2,"[0.08983771, -0.03493085, -0.057357617, 0.0222...","{'indices': [7096, 8508, 13677, 24734, 26026, ...",,{'text': ' What is the step by step guide to i...
2,3,"[-0.046798065, 0.1551149, -0.03920019, 0.04878...","{'indices': [6065, 13677, 17109, 20780, 24734,...",,{'text': ' What is the story of Kohinoor (Koh-...
3,4,"[-0.077349104, 0.14786911, -0.0128817065, -0.0...","{'indices': [2408, 6065, 7582, 12225, 17109, 2...",,{'text': ' What would happen if the Indian gov...
4,5,"[-0.028324936, 0.037209604, -0.00040033547, 0....","{'indices': [5388, 12812, 18181, 19960, 20780,...",,{'text': ' How can I increase the speed of my ...


In [None]:
import random
n = random.randint(0,522931)

print(len(dataset.documents['blob'][n]['text'].split(' ')), dataset.documents['blob'][n])
print(len(dataset.documents['sparse_values'][n]['indices']), dataset.documents['sparse_values'][n])
print(len(dataset.documents['values'][n]), dataset.documents['values'][n])

9 {'text': " What role does honour play in today's society?"}
11 {'indices': array([ 4738,  8490, 23463, 27722, 39832, 42257, 43002, 54542, 54613,
       58596, 62793]), 'values': array([0.41805061, 0.41805061, 0.41805061, 0.41805061, 0.41805061,
       0.41805061, 0.41805061, 0.41805061, 0.41805061, 0.41805061,
       0.41805061])}
384 [ 2.81952862e-02  1.27759382e-01 -3.32345814e-02 -9.01821628e-02
 -3.86698470e-02  5.37595190e-02  3.99006046e-02 -1.20273463e-01
 -1.75523981e-02  8.50154832e-03  3.06390808e-03 -8.30231886e-03
 -5.17720310e-03  5.12583666e-02  1.07031213e-02  2.11292524e-02
  1.95677914e-02  3.32711749e-02 -7.65894353e-02 -1.06330577e-03
 -9.78886113e-02  9.91917029e-03  1.08835930e-02  2.70390161e-03
 -9.92706046e-02 -1.62062105e-02  2.37394441e-02 -1.63184442e-02
 -7.05988929e-02 -5.56001030e-02 -1.03309518e-03 -1.02186061e-01
  1.81651101e-01  3.29538062e-02 -1.28898814e-01  9.24681574e-02
  1.20489776e-01 -2.71093864e-02 -2.72359010e-02 -6.28088936e-02
 -4.5161217

In [None]:
len(dataset)

522931

In [None]:
len(dataset)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_async_upsert',
 '_config',
 '_create_index',
 '_dataset_path',
 '_documents',
 '_fs',
 '_is_datatype_exists',
 '_load_metadata',
 '_metadata',
 '_pinecone_client',
 '_queries',
 '_read_pandas_dataframe',
 '_safe_read_from_path',
 'documents',
 'from_catalog',
 'from_pandas',
 'from_path',
 'head',
 'iter_documents',
 'iter_queries',
 'metadata',
 'queries',
 'to_catalog',
 'to_path',
 'to_pinecone_index',
 'to_pinecone_index_async']

In [None]:
dataset.documents.drop(['metadata'], axis=1, inplace=True)

In [None]:
dataset.head(3)

Unnamed: 0,id,values,sparse_values,blob
0,1,"[0.06814987, -0.039664183, -0.06096721, 0.0074...","{'indices': [7096, 8508, 13677, 23041, 24734, ...",{'text': ' What is the step by step guide to i...
1,2,"[0.08983771, -0.03493085, -0.057357617, 0.0222...","{'indices': [7096, 8508, 13677, 24734, 26026, ...",{'text': ' What is the step by step guide to i...
2,3,"[-0.046798065, 0.1551149, -0.03920019, 0.04878...","{'indices': [6065, 13677, 17109, 20780, 24734,...",{'text': ' What is the story of Kohinoor (Koh-...


In [None]:
# from pinecone_datasets import load_dataset

# dataset = load_dataset('quora_all-MiniLM-L6-bm25')
# # we drop sparse_values as they are not needed for this example
# dataset.documents.drop(['metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)
# we will use 80K rows of the dataset between rows 240K -> 320K
dataset.documents.drop(dataset.documents.index[320_000:], inplace=True)
dataset.documents.drop(dataset.documents.index[:240_000], inplace=True)
dataset.head()

Unnamed: 0,id,values,sparse_values,metadata
240000,515997,"[-0.00531694, 0.06937869, -0.0092854, 0.003286...","{'indices': [845, 1657, 13677, 20780, 27058, 2...","{'text': ' Why is a ""law of sciences"" importan..."
240001,515998,"[-0.09243751, 0.065432355, -0.06946959, 0.0669...","{'indices': [2110, 6324, 9754, 13677, 15207, 2...",{'text': ' Is it possible to format a BitLocke...
240002,515999,"[-0.021924071, 0.032280188, -0.020190848, 0.07...","{'indices': [2110, 4949, 23579, 23758, 27058, ...",{'text': ' Can formatting a hard drive stress ...
240003,516000,"[-0.120020054, 0.024080949, 0.10693012, -0.018...","{'indices': [22014, 24734, 24773, 25791, 25991...",{'text': ' Are the new Samsung Galaxy J7 and J...
240004,516001,"[-0.095293395, -0.048446465, -0.017618902, -0....","{'indices': [307, 2110, 5785, 12969, 12971, 13...",{'text': ' I just watched an add for Indonesia...


{'indices': array([ 7096,  8508, 13677, 23041, 24734, 26026, 28331, 29963, 39832,
        42257, 43002, 43099, 57136, 62793]),
 'values': array([0.37557059, 0.37557059, 0.37557059, 0.37557059, 0.37557059,
        0.37557059, 0.37557059, 0.37557059, 0.37557059, 0.37557059,
        0.37557059, 0.37557059, 0.37557059, 0.37557059])}

In [None]:
print(len(dataset))

80000

## Creating an Index

Now the data is ready, we can set up our index to store it.

We begin by initializing our connection to Pinecone. To do this we need a [free API key](https://app.pinecone.io).

In [None]:
import os

PINECONE_API_KEY='dasarpai-ddf2ef12-9108-4e80-b438-8f86a0b617c0-zero'
PINECONE_ENVIRONMENT='dasarpai-asia-southeast1-gcp-free-zero'

os.environ['PINECONE_API_KEY'] = PINECONE_API_KEY
os.environ['PINECONE_ENVIRONMENT'] = PINECONE_ENVIRONMENT


In [None]:
import pinecone

# get api key from app.pinecone.io
PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'
# find your environment next to the api key in pinecone console
PINECONE_ENV = os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'

pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENV
)

Now we create a new index called `semantic-search-fast`. It's important that we align the index `dimension` and `metric` parameters with those required by the `MiniLM-L6` model.

# 5 Different Types of Icon for 5 Different Objects in Python
- Package
- Module
- Class
- Function
- NoneType

You can see these icons by typing . after that object name. It comes as context sentive help

In [None]:
help(pinecone.Vector) #Help on class Vector in module pinecone.core.client.model.vector:

In [None]:
help(pinecone.core) #Help on package pinecone.core in pinecone:

In [None]:
help(pinecone.delete_index) #Help on function delete_index in module pinecone.manage:

In [None]:
help(pinecone.InfoResult) #Help on NoneType object:

In [None]:
help(pinecone.Config) #Help on _CONFIG in module pinecone.config object:

In [None]:
help(pinecone.exceptions) #Help on module pinecone.exceptions in pinecone:

In [None]:
dir(pinecone)

In [None]:
index_name = 'hbqa' #'semantic-search-fast'

# Working with Pinecone
Pinecone is vector database. It allows only one free pod. It means we cannot create more than one free index. Check what other index already exists. It it exist and you want to create vector database for free then you need to remove that index first.

In [None]:
pinecone.list_indexes()

['hbqa']

In [None]:
import time

# only create index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=len(dataset.documents.iloc[0]['values']),
        metric='cosine'
    )
    # wait a moment for the index to be fully initialized
    time.sleep(1)

# now connect to the index
index = pinecone.GRPCIndex(index_name)

Upsert the data:

In [None]:
x= [random.random() for _ in range(vector_dim)]
print(x)

[0.7137448866045683, 0.1990146888996609, 0.6686847409759294, 0.25791093351482874, 0.4759451388524568, 0.5885718756419783, 0.20580051812514588, 0.26045081877254384, 0.47881612213518987, 0.5161145653200749, 0.9392906984763799, 0.33541535844284764, 0.6664705238819099, 0.8017660414828739, 0.0064000558399999186, 0.48778896495023916, 0.7369169225793167, 0.17074772939466454, 0.31087542680667857, 0.9466958107119855, 0.5671745627083019, 0.17457145096321725, 0.21907842564982527, 0.8577996046023121, 0.4091542760716603, 0.504033195432012, 0.6363365720513261, 0.024690664522498573, 0.7706584349187058, 0.14258940174891221, 0.2918326737550542, 0.8575469452021239, 0.7154542362019559, 0.5586882990023072, 0.1204519272853184, 0.27810719807787454, 0.8960101872099712, 0.7421318510971117, 0.259642898734916, 0.5045778115513715, 0.8831965747265172, 0.4585411720350854, 0.4241716523698065, 0.9656849778802824, 0.42984315642027227, 0.7928549330116796, 0.7792700977745133, 0.9981590117062161, 0.7104640376810878, 0.4

In [None]:
map(lambda i: (f'id-{i}', [random.random() for _ in range(vector_dim)]), range(10000))

<map at 0x7bf85e1838b0>

In [None]:
import random
import itertools

def chunks(iterable, batch_size=100):
    """A helper function to break an iterable into chunks of size batch_size."""
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))

vector_dim = 128
vector_count = 10000

# Example generator that generates many (id, vector) pairs
example_data_generator = map(lambda i: (f'id-{i}', [random.random() for _ in range(vector_dim)]), range(vector_count))

# Upsert data with 100 vectors per upsert request
for ids_vectors_chunk in chunks(example_data_generator, batch_size=100):
    # index.upsert(vectors=ids_vectors_chunk)  # Assuming `index` defined elsewhere
    print(ids_vectors_chunk)

Output hidden; open in https://colab.research.google.com to view.

In [None]:
batch[0]['sparse_values']

# id
# values
# sparse_values
  # indices
  # values


In [None]:
for batch in dataset.iter_documents(batch_size=100):
  for  b in batch:
    print(b['sparse_values']['indices'])
    #index.upsert(batch)
    # print(batch)
    # break

## Making Queries

Now that our index is populated we can begin making queries. We are performing a semantic search for *similar questions*, so we should embed and search with another question. Let's begin.

In [None]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
model

Now let's query.

In [None]:
query = "which city has the highest population in the world?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(xq, top_k=5, include_metadata=True)
xc

{'matches': [{'id': '69331',
              'metadata': {'text': " What's the world's largest city?"},
              'score': 0.7856553,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '69332',
              'metadata': {'text': ' What is the biggest city?'},
              'score': 0.7271396,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '84749',
              'metadata': {'text': " What are the world's most advanced "
                                   'cities?'},
              'score': 0.7092116,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '109231',
              'metadata': {'text': ' Where is the most beautiful city in the '
                                   'world?'},
              'score': 0.6960551,
              'sparse_values': {'indices': [], 'values': []},
              'values': []}

In the returned response `xc` we can see the most relevant questions to our particular query — we don't have any exact matches but we can see that the returned questions are similar in the topics they are asking about. We can reformat this response to be a little easier to read:

In [None]:
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.79:  What's the world's largest city?
0.73:  What is the biggest city?
0.71:  What are the world's most advanced cities?
0.7:  Where is the most beautiful city in the world?
0.66:  What is the greatest, most beautiful city in the world?


These are good results, let's try and modify the words being used to see if we still surface similar results.

In [None]:
query = "which metropolis has the highest number of people?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(xq, top_k=5, include_metadata=True)
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.64:  What is the biggest city?
0.6:  What is the most dangerous city in USA?
0.59:  What's the world's largest city?
0.59:  What is the most dangerous city in USA? Why?
0.58:  What are the world's most advanced cities?


Here we used different terms in our query than that of the returned documents. We substituted **"city"** for **"metropolis"** and **"populated"** for **"number of people"**.

Despite these very different terms and *lack* of term overlap between query and returned documents — we get highly relevant results — this is the power of *semantic search*.

You can go ahead and ask more questions above. When you're done, delete the index to save resources:

In [None]:
pinecone.delete_index(index_name)

---