# Semantic Search with Cohere and Pinecone

In this notebook we will demonstrate how to perform semantic search for identifying similar or duplicate questions using Cohere and Pinecone.

![Steps in semantic search process](https://raw.githubusercontent.com/pinecone-io/examples/master/integrations/cohere/assets/index_query_pinecone_cohere.png)

## Setup

We first need to setup our environment and retrieve API keys for Cohere and Pinecone. Let's start with our environment, we need HuggingFace *Datasets* for our data, and the Cohere and Pinecone clients:

In [1]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install numpy cohere pinecone-client


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Frameworks/Python.framework/Versions/3.10/bin/python3 -m pip install --upgrade pip[0m


And sign up for an API key over at [Cohere](https://os.cohere.ai/) and [Pinecone](https://app.pinecone.io), you can enter the keys directly in the cell below.

In [2]:
COHERE_KEY = 'nuyihjgwcybIE7SVhKtMw7zRp3vLuRJY94SsMKyl'
PINECONE_KEY = '2f14cfcd-33fa-453c-afc2-a9599563894a'

## Create Embeddings

We can create sentence embeddings easily using Cohere. First, we import the Cohere client and initialize our connection using the API key we retrieved earlier.

In [3]:
import cohere

co = cohere.Client(COHERE_KEY)

We will load the **T**ext **RE**trieval **C**onference (TREC) question classification dataset which contains 5.5K labeled questions. We will take the first 1K samples for this demo, but this can be scaled to millions or even billions of samples.

In [223]:
filename = "tensor7.txt"
with open(filename) as file:
    lines = [line.strip() for line in file]
txt = " ".join(lines)

print(len(txt))

17452


In [224]:
print(lines[0])

221A Lecture Notes


In [225]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(txt)
print(len(tokens))

5655


In [226]:
line_nums = [len(" ".join(line)) for line in lines]
def getLineNum(c_seen):
    for i in range(len(line_nums)):
        if c_seen > line_nums[i]:
            c_seen -= line_nums[i]
        else:
            return i

In [245]:
chunks = []
characters_seen = 0
chunk_size = 250
overlap = chunk_size//10
for i in range(0, len(tokens), chunk_size-overlap):
    cur_chunk_tokens = tokens[i:i+chunk_size]
    fro = getLineNum(characters_seen)
    characters_seen += len("".join(cur_chunk_tokens))
    to = getLineNum(characters_seen)
    chunks.append({
        'text': " ".join(cur_chunk_tokens),
        'loc': {'from': fro, 'to': to}
        }
    )
    
print(len(chunks))
print(chunks[0])

26
{'text': '221A Lecture Notes Notes on Tensor Product 1 What is “ Tensor ” ? After discussing the tensor product in the class , I received many questions what it means . I ’ ve also talked to Daniel , and he felt this is a subject he had learned on the way here and there , never in a course or a book . I myself don ’ t remember where and when I learned about it . It may be worth solving this problem once and for all . Apparently , Tensor is a manufacturer of fancy floor lamps . See http : //growinglifestyle.shop.com/cc.class/cc ? pcd=7878065 & ccsyn=13549 . For us , the word “ tensor ” refers to objects that have multiple indices . In comparison , a “ scalar ” does not have an index , and a “ vector ” one index . It appears in many different contexts , but this point is always the same . 2 Direct Sum Before getting into the subject of tensor product , let me first discuss “ direct sum. ” This is a way of getting a new big vector space from two ( or more ) smaller vector spaces in the

In [246]:
embeds = co.embed(
    texts=[chunk['text'] for chunk in chunks],
    model='small',
    truncate='LEFT'
).embeddings

In [247]:
shape = len(embeds)
print(shape)

26


In [248]:
import pinecone
pinecone.init(
    PINECONE_KEY,
    environment="northamerica-northeast1-gcp"  # find next to API key in console
)
print('pinecone initialized')

pinecone initialized


In [249]:
index = pinecone.Index('cohere-index')
print('pinecone index initialized')

pinecone index initialized


In [250]:
batch_size = 128

ids = [str(i) for i in range(shape)]
# create list of metadata dictionaries
meta = [{'text': chunk['text'], 'source': "", "pdf_numpages": 695, 'line-from': chunk['loc']['from'], 'line-to': chunk['loc']['to']} for chunk in chunks]

# create list of (id, vector, metadata) tuples to be upserted
to_upsert = list(zip(ids, embeds, meta))

i_start = 0
for i in range(i_start, shape, batch_size):
    print(i)
    i_end = min(i+batch_size, shape)
    
#     print(to_upsert[i:i_end])
    
    index.upsert(vectors=to_upsert[i:i_end],
             namespace='tensor')
#     time.sleep(1)

0


We can then pass these questions to Cohere to create embeddings.

---