# Basic Semantic Search
Language models give computers the ability to search by meaning and go beyond searching by matching keywords. This capability is called semantic search. 

![Searching an archive using sentence embeddings](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/basic-semantic-search-overview.png?3)

In this notebook, we'll build a simple semantic search engine. The applications of semantic search go beyond building a web search engine. They can empower a private search engine for internal documents or records. It can also be used to power features like StackOverflow's "similar questions" feature.

1. Get the archive of questions
2. [Embed](https://docs.cohere.ai/embed-reference/) the archive
3. Search using an index and nearest neighbor search
4. Visualize the archive based on the embeddings 

In [1]:
# Install Cohere for embeddings, Umap to reduce embeddings to 2 dimensions, 
# Altair for visualization, Annoy for approximate nearest neighbor search
!pip install cohere umap-learn altair annoy datasets

Collecting cohere
  Downloading cohere-1.0.2-py3-none-any.whl (7.5 kB)
Collecting umap-learn
  Downloading umap-learn-0.5.2.tar.gz (86 kB)
[K     |████████████████████████████████| 86 kB 4.4 MB/s 
Collecting annoy
  Downloading annoy-1.17.0.tar.gz (646 kB)
[K     |████████████████████████████████| 646 kB 64.3 MB/s 
Collecting pynndescent>=0.5
  Downloading pynndescent-0.5.5.tar.gz (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 55.0 MB/s 
Building wheels for collected packages: umap-learn, pynndescent, annoy
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Created wheel for umap-learn: filename=umap_learn-0.5.2-py3-none-any.whl size=82709 sha256=610143bc4c0832c4d93c5e8f071271442bccfa663a1a6ab8651e0f7829b8819d
  Stored in directory: /root/.cache/pip/wheels/84/1b/c6/aaf68a748122632967cef4dffef68224eb16798b6793257d82
  Building wheel for pynndescent (setup.py) ... [?25l[?25hdone
  Created wheel for pynndescent: filename=pynndescent-0.5.5-py3-none-any.whl si

Get your Cohere API key by [signing up here](https://os.cohere.ai/register). Paste it in the cell below.

In [1]:
import cohere
import numpy as np
import re
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
import umap
import altair as alt
from sklearn.metrics.pairwise import cosine_similarity
from annoy import AnnoyIndex
pd.set_option('display.max_colwidth', None)

# Paste your API key here. Remember to not share publicly
api_key = ''

## 1. Get The Archive of Questions
We'll use the [trec](https://www.tensorflow.org/datasets/catalog/trec) dataset which is made up of questions and their categories.

In [2]:
# Get dataset
dataset = load_dataset("trec", split="train")

# Import into a pandas dataframe, take only the first 1000 rows
df = pd.DataFrame(dataset)[:1000]
# df['text'] = df['text'].str.decode('utf-8')

# Preview the data to ensure it has loaded correctly
df.head(10)

Using custom data configuration default
Reusing dataset trec (/Users/orca/.cache/huggingface/datasets/trec/default/1.1.0/751da1ab101b8d297a3d6e9c79ee9b0173ff94c4497b75677b59b61d5467a9b9)


Unnamed: 0,label-coarse,label-fine,text
0,0,0,How did serfdom develop in and then leave Russia ?
1,1,1,What films featured the character Popeye Doyle ?
2,0,0,How can I find a list of celebrities ' real names ?
3,1,2,What fowl grabs the spotlight after the Chinese Year of the Monkey ?
4,2,3,What is the full form of .com ?
5,3,4,What contemptible scoundrel stole the cork from my lunch ?
6,3,5,What team did baseball 's St. Louis Browns become ?
7,3,6,What is the oldest profession ?
8,0,7,What are liver enzymes ?
9,3,4,Name the scar-faced bounty hunter of The Old West .


## 2. Embed the archive
Let's now embed the text of the questions
![embedding archive texts](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/semantic-search-embed-text-archive.png)

In [5]:
#@title Embedding code (Run this cell to execute required code) {display-mode: "form"}

# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key)


# Define a Cohere batch embedding object used to batch embed text
class cohereBatchEmbedder():
    def __init__(self, model, client, batch_size=64):
        self.model = model
        self.client = client
        self.batch_size = batch_size
        self.embeddings = []

    def batch_embed(self, examples):
        self.embeddings = []
        for i in tqdm(range(0,len(examples),self.batch_size)):
            batch = examples[i:i+self.batch_size]
            emb = self.client.embed(texts=batch,
                                    model=self.model,
                                    truncate="LEFT").embeddings
            self.embeddings += emb
        return self.embeddings

embedder = cohereBatchEmbedder('small',co,)

In [6]:
# Get the embeddings
embeds = embedder.batch_embed(list(df['text']))

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:19<00:00,  1.25s/it]


In [8]:
# Check the dimensions of the embeddings
embeds = np.array(embeds)
embeds.shape

(1000, 1024)

## 3. Search using an index and nearest neighbor search
Let's now [Annoy](https://github.com/spotify/annoy) to store the embeddings and retrieve search results. This approach scales well to a large number of texts (other options include [Faiss](https://github.com/facebookresearch/faiss), [ScaNN](https://github.com/google-research/google-research/tree/master/scann), and [PyNNDescent](https://github.com/lmcinnes/pynndescent)).

In [9]:
# Create the search index, pass the size of embedding
search_index = AnnoyIndex(embeds.shape[1], 'angular')
# Add all the vectors to the search index
for i in range(len(embeds)):
    search_index.add_item(i, embeds[i])

search_index.build(10) # 10 trees
search_index.save('test.ann')

True

### Find the neighbors of an example from the dataset
If we're only interested in measuring the similarities between the questions in the dataset (no outside queries), a simple way is to calculate the similarities between every pair of embeddings we have.

In [10]:
# Choose an example (we'll retrieve others similar to it)
example_id = 92

# Retrieve nearest neighbors
similar_item_ids = search_index.get_nns_by_item(example_id,10,
                                                include_distances=True)
# Format and print the text and distances
results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['text'], 
                             'distance': similar_item_ids[1]}).drop(example_id)

print(f"Question:'{df.iloc[example_id]['text']}'\nNearest neighbors:")
results

Question:'What are bear and bull markets ?'
Nearest neighbors:


Unnamed: 0,texts,distance
614,What animals do you find in the stock market ?,0.896121
137,What are equity securities ?,0.97026
601,What is `` the bear of beers '' ?,0.978348
307,What does NASDAQ stand for ?,0.997819
683,What is the rarest coin ?,1.027727
112,What are the world 's four oceans ?,1.049661
864,When did the Dow first reach ?,1.050362
547,Where can stocks be traded on-line ?,1.053685
871,What are the Benelux countries ?,1.054899


### Find the neighbors of a user query
We're not limited to searching using existing items. If we get a query, we can embed it and find its nearest neighbors from the dataset.

In [11]:
query = "What is the tallest mountain in the world?"

# Get the query's embedding
query_embed = embedder.batch_embed([query])

# Retrieve the nearest neighbors
similar_item_ids = search_index.get_nns_by_vector(query_embed[0],10,
                                                include_distances=True)
# Format the results
results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['text'], 
                             'distance': similar_item_ids[1]})
results

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.78s/it]


Unnamed: 0,texts,distance
236,What is the name of the tallest mountain in the world ?,0.431913
670,What is the highest mountain in the world ?,0.43629
907,What mountain range is traversed by the highest railroad in the world ?,0.715265
435,What is the highest peak in Africa ?,0.717943
354,What ocean is the largest in the world ?,0.762917
412,What was the highest mountain on earth before Mount Everest was discovered ?,0.767649
109,Where is the highest point in Japan ?,0.784319
114,What is the largest snake in the world ?,0.789743
656,What 's the tallest building in New York City ?,0.793982
901,What 's the longest river in the world ?,0.794352


## 4. Visualizing the archive
Finally, let's plot out all the questions onto a 2D chart so you're able to visualize the semantic similarities of this dataset!

In [12]:
#@title Plot the archive {display-mode: "form"}

# UMAP reduces the dimensions from 1024 to 2 dimensions that we can plot
reducer = umap.UMAP(n_neighbors=20) 
umap_embeds = reducer.fit_transform(embeds)
# Prepare the data to plot and interactive visualization
# using Altair
df_explore = pd.DataFrame(data={'text': df['text']})
df_explore['x'] = umap_embeds[:,0]
df_explore['y'] = umap_embeds[:,1]

# Plot
chart = alt.Chart(df_explore).mark_circle(size=60).encode(
    x=#'x',
    alt.X('x',
        scale=alt.Scale(zero=False)
    ),
    y=
    alt.Y('y',
        scale=alt.Scale(zero=False)
    ),
    tooltip=['text']
).properties(
    width=700,
    height=400
)
chart.interactive()

OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


Hover over the points to read the text. Do you see some of the patterns in clumped points? Similar questions, or questions asking about similar topics?
