[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/semantic-search/semantic-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/semantic-search/semantic-search.ipynb)

# Semantic Search

[![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/fast-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb)

In this walkthrough we will see how to use Pinecone for semantic search. To begin we must install the required prerequisite libraries:

In [1]:
!pip install -qU \
  "pinecone-client[grpc]"==3.1.0 \
  datasets==2.12.0 \
  sentence-transformers==2.2.2

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.0/211.0 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m40.1 M

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Data Preprocessing

The dataset preparation process requires a few steps:

1. We download the Quora dataset from Hugging Face Datasets.

2. The text content of the dataset is embedded into vectors.

3. We reformat into a `(id, vector, metadata)` structure to be added to Pinecone.

We will see how steps `1`, `2`, and `3` are done in this section, but we won't implement `2` and `3` across the whole dataset until we reach the *upsert loop* as we will iteratively perform these two steps.

In either case, this can take some time. If you'd rather skip the data preparation step and get straight to upserts and testing the semantic search functionality, you should
refer to the [**fast notebook**](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb).

In [3]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("csv", data_files="/content/walmart1_data.csv")

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-df7dfb4bad9031d8/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-df7dfb4bad9031d8/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

The dataset contains ~400K pairs of natural language questions from Quora.

In [26]:
data['description']

['If you\'re passionate about IT and electronics, like being up to date on technology and don\'t miss even the slightest details, buy Smartwatch GARMIN Vivoactive 3 1,2" GPS Waterproof 5 ATM Glonass Black Stainless steel at an unbeatable price.Colour: BlackBluetooth: yesApprox. range: 7 DaysMaterial: Stainless steelSiliconeCompatible: iPhone, AndroidScreen: 1,2"GPS: yesResolution: 240 x 240 pxGlonass: yesPedometer: yesHeart-rate Monitor: yesImpermeable: 5 atmGarmin Pay: yes',
 'Take 6 tablets daily. NOW USDA Certified Organic Spirulina delivers the natural nutrient profile found in genuine whole foods. Spirulina is a blue-green microalgae that has naturally occurring protein, plus other nutrients such as vitamins, minerals and GLA (gamma-linolenic acid). NOW Certified Organic Spirulina Tablets are pure and contain no excipients, binders, or additives.',
 'The Thermo Whip from ISI offers the ultimate tool to foam various foods. Designed to be used with both hot and cold ingredients with

Whether or not the questions are duplicates is not so important, all we need for this example is the text itself. We can extract them all into a single `questions` list.

In [35]:
descriptions = []

for record in data['description']:
    descriptions.append(record)

# remove duplicates
questions = list(set(descriptions))

print('\n'.join(descriptions[:10]))
len(descriptions)

If you're passionate about IT and electronics, like being up to date on technology and don't miss even the slightest details, buy Smartwatch GARMIN Vivoactive 3 1,2" GPS Waterproof 5 ATM Glonass Black Stainless steel at an unbeatable price.Colour: BlackBluetooth: yesApprox. range: 7 DaysMaterial: Stainless steelSiliconeCompatible: iPhone, AndroidScreen: 1,2"GPS: yesResolution: 240 x 240 pxGlonass: yesPedometer: yesHeart-rate Monitor: yesImpermeable: 5 atmGarmin Pay: yes
Take 6 tablets daily. NOW USDA Certified Organic Spirulina delivers the natural nutrient profile found in genuine whole foods. Spirulina is a blue-green microalgae that has naturally occurring protein, plus other nutrients such as vitamins, minerals and GLA (gamma-linolenic acid). NOW Certified Organic Spirulina Tablets are pure and contain no excipients, binders, or additives.
The Thermo Whip from ISI offers the ultimate tool to foam various foods. Designed to be used with both hot and cold ingredients with the added b

30

In [36]:
descriptions

['If you\'re passionate about IT and electronics, like being up to date on technology and don\'t miss even the slightest details, buy Smartwatch GARMIN Vivoactive 3 1,2" GPS Waterproof 5 ATM Glonass Black Stainless steel at an unbeatable price.Colour: BlackBluetooth: yesApprox. range: 7 DaysMaterial: Stainless steelSiliconeCompatible: iPhone, AndroidScreen: 1,2"GPS: yesResolution: 240 x 240 pxGlonass: yesPedometer: yesHeart-rate Monitor: yesImpermeable: 5 atmGarmin Pay: yes',
 'Take 6 tablets daily. NOW USDA Certified Organic Spirulina delivers the natural nutrient profile found in genuine whole foods. Spirulina is a blue-green microalgae that has naturally occurring protein, plus other nutrients such as vitamins, minerals and GLA (gamma-linolenic acid). NOW Certified Organic Spirulina Tablets are pure and contain no excipients, binders, or additives.',
 'The Thermo Whip from ISI offers the ultimate tool to foam various foods. Designed to be used with both hot and cold ingredients with

With our questions ready to go we can move on to demoing steps **2** and **3** above.

### Building Embeddings and Upsert Format

To create our embeddings we will us the `MiniLM-L6` sentence transformer model. This is a very efficient semantic similarity embedding model from the `sentence-transformers` library. We initialize it like so:

In [12]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != 'cuda':
    print(f"You are using {device}. This is much slower than using "
          "a CUDA-enabled GPU. If on Colab you can change this by "
          "clicking Runtime > Change runtime type > GPU.")

model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
model

You are using cpu. This is much slower than using a CUDA-enabled GPU. If on Colab you can change this by clicking Runtime > Change runtime type > GPU.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/90.4M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

There are *three* interesting bits of information in the above model printout. Those are:

* `max_seq_length` is `256`. That means that the maximum number of tokens (like words) that can be encoded into a single vector embedding is `256`. Anything beyond this *must* be truncated.

* `word_embedding_dimension` is `384`. This number is the dimensionality of vectors output by this model. It is important that we know this number later when initializing our Pinecone vector index.

* `Normalize()`. This final normalization step indicates that all vectors produced by the model are normalized. That means that models that we would typical measure similarity for using *cosine similarity* can also make use of the *dotproduct* similarity metric. In fact, with normalized vectors *cosine* and *dotproduct* are equivalent.

Moving on, we can create a sentence embedding using this model like so:

In [13]:
query = 'Give clock'

xq = model.encode(query)
xq.shape

(384,)

Encoding this single sentence leaves us with a `384` dimensional sentence embedding (aligned to the `word_embedding_dimension` above).

To prepare this for `upsert` to Pinecone, all we do is this:

In [14]:
_id = '0'
metadata = {'text': query}

vectors = [(_id, xq, metadata)]

Later when we do upsert our data to Pinecone, we will be doing so in batches. Meaning `vectors` will be a list of `(id, embedding, metadata)` tuples.

## Creating an Index

Now the data is ready, we can set up our index to store it.

We begin by initializing our connection to Pinecone. To do this we need a [free API key](https://app.pinecone.io).

In [19]:
import os
PINECONE_API_KEY = os.environ.get('1f0507d5-49d7-4e10-bba5-f453d174ebd5')


if not os.environ.get("PINECONE_API_KEY"):
    from pinecone_notebooks.colab import Authenticate
    Authenticate()

In [20]:
import os
from pinecone import Pinecone


# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [21]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

Now we create a new index called `semantic-search`. It's important that we align the index `dimension` and `metric` parameters with those required by the `MiniLM-L6` model.

In [22]:
index_name = 'semantic-search'

In [23]:
import time

# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=model.get_sentence_embedding_dimension(),
        metric='cosine',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Now we upsert the data, we will do this in batches of `128`.

_**Note:** On Google Colab with GPU expected runtime is ~7 minutes. If using CPU this will be significantly longer. If you'd like to get this running faster refer to the [fast notebook](https://github.com/pinecone-io/examples/blob/master/search/semantic-search/semantic-search-fast.ipynb)._

In [37]:
from tqdm.auto import tqdm

batch_size = 128
vector_limit = 100000

descriptions = descriptions[:vector_limit]

for i in tqdm(range(0, len(descriptions), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(descriptions))
    # create IDs batch
    ids = [str(x) for x in range(i, i_end)]
    # create metadata batch
    metadatas = [{'text': text} for text in descriptions[i:i_end]]
    # create embeddings
    xc = model.encode(descriptions[i:i_end])
    # create records list for upsert
    records = zip(ids, xc, metadatas)
    # upsert to Pinecone
    index.upsert(vectors=records)

# check number of records in the index
index.describe_index_stats()

  0%|          | 0/1 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

## Making Queries

Now that our index is populated we can begin making queries. We are performing a semantic search for *similar questions*, so we should embed and search with another question. Let's begin.

In [38]:
query = "Which is a nutritional tablet?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
xc

{'matches': [{'id': '1',
              'metadata': {'text': 'Take 6 tablets daily. NOW USDA Certified '
                                   'Organic Spirulina delivers the natural '
                                   'nutrient profile found in genuine whole '
                                   'foods. Spirulina is a blue-green '
                                   'microalgae that has naturally occurring '
                                   'protein, plus other nutrients such as '
                                   'vitamins, minerals and GLA '
                                   '(gamma-linolenic acid). NOW Certified '
                                   'Organic Spirulina Tablets are pure and '
                                   'contain no excipients, binders, or '
                                   'additives.'},
              'score': 0.5235309,
              'values': []},
             {'id': '7',
              'metadata': {'text': 'Previous page We are a manufacturer and '
         

In the returned response `xc` we can see the most relevant questions to our particular query. We can reformat this response to be a little easier to read:

In [39]:
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.52: Take 6 tablets daily. NOW USDA Certified Organic Spirulina delivers the natural nutrient profile found in genuine whole foods. Spirulina is a blue-green microalgae that has naturally occurring protein, plus other nutrients such as vitamins, minerals and GLA (gamma-linolenic acid). NOW Certified Organic Spirulina Tablets are pure and contain no excipients, binders, or additives.
0.25: Previous page We are a manufacturer and seller specializing in the research and development of personal care products to solve the body care problems in public life. We are health managers and logistical guarantors of your body. Every detail of the product is applied to the human body to create a healthy code for the body, and to prevent incorrect personal habits from harming your body and causing chronic pain. Click to discover more! Next page
0.23: Condition: Dry or rough skin in need of moisture, especially on tougher areas such as the elbows, knees, and feet. Solution: Certified Organic Shea Butt

These are good results, let's try and modify the words being used to see if we still surface similar results.

In [40]:
query = "?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(vector=xq, top_k=5, include_metadata=True)
for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.07: Expandable Adjust this colander length from 15" to 19.8". Fit most family sinks. It is longer than the ordinary expandable colander. Collapsible Adjust the height from 5" to 2". The colander that is 5 inches deeper makes it less likely to drop down when you use it to hold more fruit. The colander that is 2 inches deep is more shallow, but it can save more space when you want to store it in the cupboard. Fix Wide 10" Capacity 7 quart max Fit Sink Size 12.5" to 17" Previous page Higher Filtration Efficiency The evenly spaced holes at the bottom allow for quick draining and increased airflow. Max 302℉ / 150℃ The temperature is higher than that of boiling water, making it safe to strain hot water. High Bearing Capacity After testing, this colander has been shown to effortlessly accommodate up to 25 lbs. Next page 1 Filtration Fast 2 Heat-Resistant 3 High Bearing Capacity   4 Anti-slip rubber pads Our product features four anti-slip rubber pads for stability over the sink.   Stable De

Here we used different terms in our query than that of the returned documents. We substituted **"city"** for **"metropolis"** and **"populated"** for **"number of people"**.

Despite these very different terms and *lack* of term overlap between query and returned documents — we get highly relevant results — this is the power of *semantic search*.

You can go ahead and ask more questions above. When you're done, delete the index to save resources:

In [None]:
pc.delete_index(index_name)

---