**Medical Semantic Search Engine** Inspired Development Environment

In [None]:
!pip install -U cohere pinecone-client datasets

In [None]:
import cohere
co = cohere.Client("8INyfGiry5dseAz8Wg1E0w50PUe1vV2WgN9mKf0T")

In [None]:
from datasets import load_dataset
# load the first 1K rows of the TREC dataset
trec = load_dataset('trec', split='train[:1000]')
embeds = co.embed(
    texts=trec['text'],
    model='small',
    truncate='LEFT'
).embeddings

In [None]:
trec

In [None]:
import numpy as np
shape = np.array(embeds).shape
print(shape)

In [None]:
import pinecone
# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(api_key='b0a30cde-a880-495f-93d9-e6bf0788860d', \
              environment='asia-southeast1-gcp')
index_name = 'cohere-pinecone-trec'

# if the index does not exist, we create it
if index_name not in pinecone.list_indexes():
  pinecone.create_index(
      index_name,
      dimension=shape[1],
      metric='cosine'
  )

# connect to index
index = pinecone.Index(index_name)

We can begin populating the index with our embeddings. pinecone expects us to provide a list of tuples in the format (id, vector, metadata), where the metadata field is an optional extra field where we can store anything we want in a dictionary format. For this example, we will store the original text of the embeddings.

In [None]:
batch_size = 128
ids = [str(i) for i in range(shape[0])]

# create a list of metadata dictionaries
meta = [{'text': text} for text in trec['text']]

# create list of (id, vector, metadata) tuples to be upserted
to_upsert = list(zip(ids, embeds, meta))

for i in range(0, shape[0], batch_size):
  i_end = min(i+batch_size, shape[0])
  index.upsert(vectors=to_upsert[i:i_end])

# let's view the index statistics
print(index.describe_index_stats())

*Sematic Search* We have indexed our vectors we can perform a few search queries. When searching we first embed our query using Cohere, and then search using the returned vector in Pinecone.

In [None]:
query = "Why was there a long-term economic downturn in the early 20th century?"
# query = "What was the cause of the major recession in the early 20th century?"
# query = "What caused the 1929 Great Depression?"
# create the query embedding
xq = co.embed(
    texts=[query],
    model='small',
    truncate='LEFT'
).embeddings

print(np.array(xq).shape)

# query, returning the top 5 most similar results
res = index.query(xq, top_k=5, include_metadata=True)

In [None]:
for match in res['matches']:
  print(f"{match['score']:.2f}: {match['metadata']['text']}")