# Semantic Search with Cohere Embed Jobs and Pinecone serverless Solution

In [None]:
# TODO: upgrade to "cohere>5"
! pip install "cohere<5" "pinecone-client>3.2.1"

In [None]:
import os
import json
import time
import numpy as np
import cohere
from pinecone import Pinecone

co = cohere.Client('COHERE_API_KEY')
pc = Pinecone(
    api_key="PINECONE_API_KEY", 
    source_tag="cohere"
)

  from tqdm.autonotebook import tqdm


## Step 1: Upload a dataset

In [None]:
# Upload a dataset for embed jobs
dataset_file_path = "data/embed_jobs_sample_data.jsonl" # Full path - https://raw.githubusercontent.com/cohere-ai/notebooks/main/notebooks/data/embed_jobs_sample_data.jsonl

ds=co.create_dataset(
	name='sample_file',
	# insert your file path here - you can upload it on the right - we accept .csv and jsonl files
	data=open(dataset_file_path, 'rb'),
	dataset_type="embed-input"
	)

print(ds.await_validation())

## Step 2: Create embeddings via Cohere's Embed Jobs endpoint

In [None]:
# Dataset has been uploaded, create an embed job and specify the input type as "search document" since this will live in your Pinecone DB
job = co.create_embed_job(dataset_id=ds.id,
                          input_type='search_document',
                          model='embed-english-v3.0',
                          embeddings_types=['float'])

job.wait() # poll the server until the job is completed 

...
...


In [None]:
print(job)

## Step 3: Prepare embeddings for upsert

In [None]:
# Load the output file into an array
output_dataset=co.get_dataset(job.output.id)
data_array = []
for record in output_dataset:
  data_array.append(record)

# Take the output and format it in the shape for upserting into Pinecone's DB
ids = [str(i) for i in range(len(data_array))]
meta = [{'text':str(data_array[i]['text'])} for i in range(len(data_array))]
embeds=[np.float32(data_array[i]['embeddings']['float']) for i in range(len(data_array))]

to_upsert = list(zip(ids, embeds, meta))

## Step 4: Initialize Pinecone vector database

In [None]:
# Initialize your Pinecone Vector DB
from pinecone import ServerlessSpec

index_name = "embed-jobs-serverless-test-example"

# A new property 'spec' is used to tell Pinecone how we should deploy your index.
pc.create_index(
name=index_name,
dimension=1024,
metric="cosine",
spec=ServerlessSpec(cloud='aws', region='us-west-2')
)

# Target your new serverless index.
idx = pc.Index(index_name)

## Step 5: Upsert embeddings into the index

In [None]:
# Upsert your data into the index
batch_size = 128

for i in range(0, len(data_array), batch_size):
    i_end = min(i+batch_size, len(data_array))
    idx.upsert(vectors=to_upsert[i:i_end])

# let's view the index statistics
print(idx.describe_index_stats())

{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 3664}},
 'total_vector_count': 3664}


## Step 6: Query the index

In [None]:
# Let's query the database
query = "What did Microsoft announce in Las Vegas?"

# create the query embedding
xq = co.embed(
    texts=[query],
    model='embed-english-v3.0',
    input_type='search_query',
    truncate='END'
).embeddings

print(np.array(xq).shape)

# query, returning the top 20 most similar results
res = idx.query(xq, top_k=20, include_metadata=True)

(1, 1024)


In [None]:
# Look at the initial retrieval results
for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.48: On October 22, 2012, Microsoft announced the release of new features including co-authoring, performance improvements and touch support.
0.45: On May 2, 2019, at F8, the company announced its new vision with the tagline "the future is private". A redesign of the website and mobile app was introduced, dubbed as "FB5". The event also featured plans for improving groups, a dating platform, end-to-end encryption on its platforms, and allowing users on Messenger to communicate directly with WhatsApp and Instagram users.
0.42: On July 13, 2009, Microsoft announced at its Worldwide Partners Conference 2009 in New Orleans that Microsoft Office 2010 reached its "Technical Preview" development milestone and features of Office Web Apps were demonstrated to the public for the first time. Additionally, Microsoft announced that Office Web Apps would be made available to consumers online and free of charge, while Microsoft Software Assurance customers will have the option of running them on pre

## Step 7: Rerank the retrieved results

In [None]:
# Add Cohere Reranking Step
docs =[match['metadata']['text'] for match in res['matches']]

rerank_response = co.rerank(
  model = 'rerank-english-v2.0',
  query = query,
  documents = docs,
  top_n = 3,
)
for response in rerank_response:
  print(f"{response.relevance_score:.2f}: {response.document['text']}")

0.99: Microsoft Office, or simply Office, is the former name of a family of client software, server software, and services developed by Microsoft. It was first announced by Bill Gates on August 1, 1988, at COMDEX in Las Vegas. Initially a marketing term for an office suite (bundled set of productivity applications), the first version of Office contained Microsoft Word, Microsoft Excel, and Microsoft PowerPoint. Over the years, Office applications have grown substantially closer with shared features such as a common spell checker, Object Linking and Embedding data integration and Visual Basic for Applications scripting language. Microsoft also positions Office as a development platform for line-of-business software under the Office Business Applications brand.
0.93: On January 21, 2015, during the "Windows 10: The Next Chapter" press event, Microsoft unveiled Office for Windows 10, Windows Runtime ports of the Android and iOS versions of the Office Mobile suite. Optimized for smartphone

## Another example - query and rerank

In [None]:
# Let's query the database
query = "What was the first youtube video about?"

# create the query embedding
xq = co.embed(
    texts=[query],
    model='embed-english-v3.0',
    input_type='search_query',
    truncate='END'
).embeddings

print(np.array(xq).shape)

# query, returning the top 20 most similar results
res = idx.query(xq, top_k=20, include_metadata=True)

# Look at the initial retrieval results
for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

(1, 1024)
0.66: YouTube began as a venture capital–funded technology startup. Between November 2005 and April 2006, the company raised money from various investors, with Sequoia Capital, $11.5 million, and Artis Capital Management, $8 million, being the largest two. YouTube's early headquarters were situated above a pizzeria and a Japanese restaurant in San Mateo, California. In February 2005, the company activated codice_1. The first video was uploaded April 23, 2005. Titled "Me at the zoo", it shows co-founder Jawed Karim at the San Diego Zoo and can still be viewed on the site. In May, the company launched a public beta and by November, a Nike ad featuring Ronaldinho became the first video to reach one million total views. The site launched officially on December 15, 2005, by which time the site was receiving 8 million views a day. Clips at the time were limited to 100 megabytes, as little as 30 seconds of footage.
0.58: Karim said the inspiration for YouTube first came from the Sup

In [None]:
# Add Cohere Reranking Step
# embeds=[np.float32(data_array[i]['embedding']) for i in range(len(data_array))]
docs =[match['metadata']['text'] for match in res['matches']]

rerank_response = co.rerank(
  model = 'rerank-english-v2.0',
  query = query,
  documents = docs,
  top_n = 3,
)
for response in rerank_response:
  print(f"{response.relevance_score:.2f}: {response.document['text']}")

0.95: YouTube began as a venture capital–funded technology startup. Between November 2005 and April 2006, the company raised money from various investors, with Sequoia Capital, $11.5 million, and Artis Capital Management, $8 million, being the largest two. YouTube's early headquarters were situated above a pizzeria and a Japanese restaurant in San Mateo, California. In February 2005, the company activated codice_1. The first video was uploaded April 23, 2005. Titled "Me at the zoo", it shows co-founder Jawed Karim at the San Diego Zoo and can still be viewed on the site. In May, the company launched a public beta and by November, a Nike ad featuring Ronaldinho became the first video to reach one million total views. The site launched officially on December 15, 2005, by which time the site was receiving 8 million views a day. Clips at the time were limited to 100 megabytes, as little as 30 seconds of footage.
0.92: Karim said the inspiration for YouTube first came from the Super Bowl XX