# How to use Cohere embeddings and rerank module with MongoDB Atlas Search / Vector Search.
Cohere embeddings are a way of representing text as high-dimensional vectors that capture the semantic meaning and context of the text. Rerank module is a feature of Cohere that allows you to customize the ranking of search results(retrieved from MongoDB Atlas) based on reranker module. Working with Cohere embeddings and rerank module, you can create powerful and flexible vector search applications/Information retrieval systems that can handle natural language queries and also address complex scenarios.

The powerful combination of technologies you can further enhance existing operational data store(ODS) functionality of MongoDB with features such as semantic search, question-answering as in RAG framework, real time threat detection, and plethora of other applications that rely on NLP and Full text search capabilities over Real time and transactions data sources that generate text data a scale.

# What we will do in this tutorial?

### Store Embeddings and Prepare the Index
1. Use the Cohere Embed Jobs to generate vector embeddings for the first time on large datasets in an asynchronous and scheduled manner using the Cohere platform.
2. Add vector embeddings into MongoDB Atlas, which can store and index these vector embeddings alongside your other operational/metadata. 
3. Finally, prepare the indexes for both vector embeddings as well as full-text search on our private dataset.

### Search with vector embeddings
1. Write a simple python function to accept search terms/phrases and pass it through the Cohere embed api again to get a query vector.
2. Take these resultant query vector embedding and perform a vector search query using the `$vectorsearch` operator in the MongoDB Aggregation Pipeline.
3. Pre-filter documents using meta information to narrow the search across your dataset; thereby speeding up the performance of vector search results while retaining accuracy.
4. The retrieved semantically similar documents can be post-filtered(relevancy score) to demonstrate a higher degree of control over the semantic search behavior.

### Search with text and rerank with Cohere
1. Write a simple python function to accept search terms / phrases and prepare a search query using $search operator and MongoDB Aggregation Pipeline.
2. Take these resultant documents perform reranking operation of the retrieved documents to achieve higher accuracy with full text search results using Cohere rerank module.


![MDB_vector_search_Cohere.png](attachment:c6d9e4ab-5157-4d81-9e3f-b2c0807e73e6.png)

In [285]:
pwd

'/Users/ashwin.gangadhar/projects/MongoDB-AI-Resources'

## Install all dependecies

In [3]:
%%bash
pip install cohere pymongo pandas



# Intitialize the packages

In [4]:
import os
import getpass
import json

# Initialize Cohere API Key

In [5]:
try:
    cohere_api_key = os.environ["COHERE_API_KEY"]
except KeyError:
    cohere_api_key = getpass.getpass("Please enter your COHERE API KEY (hit enter): ")

# Initialize MongoDB Connection String

In [6]:
try:
    MONGO_CONN_STR = os.environ["MONGO_CONNECTION_STR"]
except KeyError:
    MONGO_CONN_STR = getpass.getpass("Please enter your MongoDB Atlas Connection String (hit enter): ")

# Load dataset from S3

In [1]:
import pandas as pd
import s3fs
df = pd.read_json("s3://ashwin-partner-bucket/cohere/movies_sample_dataset.jsonl", orient="records", lines=True)
df.to_json("./movies_sample_dataset.jsonl", orient='records', lines=True)
df[:3]

Unnamed: 0,overview,title,release_date,vote_average,vote_count,adult,year,month,day,text
0,"Led by Woody, Andy's toys live happily in his ...",Toy Story,1995-10-30,7.7,5415,False,1995,10,30,"Title: Toy Story Genres: Animation,Comedy,Fam..."
1,When siblings Judy and Peter discover an encha...,Jumanji,1995-12-15,6.9,2413,False,1995,12,15,"Title: Jumanji Genres: Animation,Comedy,Famil..."
2,A family wedding reignites the ancient feud be...,Grumpier Old Men,1995-12-22,6.5,92,False,1995,12,22,"Title: Grumpier Old Men Genres: Animation,Com..."


# Initialize and Schedule the Cohere Embeddings Job to embed the `sample_movies` dataset

In [8]:
import cohere  
co_client = cohere.Client(cohere_api_key)

In [9]:
dataset = co_client.create_dataset(name='movies',
                                   data=open("./movies_sample_dataset.jsonl",'r'),
                                   keep_fields=["overview","title","year"],
                                   dataset_type="embed-input").wait()
dataset.wait()
dataset

uploading file, starting validation...
movies-wz7adk was uploaded
...


In [12]:
embed_job = co_client.create_embed_job(dataset_id=dataset.id, 
    input_type='search_document',
    model='embed-english-v3.0', 
    truncate='END')
embed_job.wait()
output_dataset = co_client.get_dataset(embed_job.output.id)
results = list(output_dataset)
len(results)

In [17]:
output_dataset

cohere.Dataset {
	id: embeded-movies-p4kcg2
	name: embeded-movies
	dataset_type: embed-result
	validation_status: validated
	created_at: 2024-01-23 10:54:50.794112
	updated_at: 2024-01-23 10:54:50.794112
	download_urls: ['https://storage.googleapis.com/cohere-user/dataset-api-temp/8268d35d-90bc-4d70-88d1-3851ec01c1c1/e89b6118-1480-4765-b6a8-0819432a0932/embeded-movies-p4kcg2/001_embeded-movies.avro?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=dataset%40cohere-production.iam.gserviceaccount.com%2F20240123%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240123T105749Z&X-Goog-Expires=259199&X-Goog-Signature=428f0194c4ea7bbcbf42e05dd37f0278b21452f8f797467834fa223c8c215e5ed2726525c4dd1779533b1e3865aa323adb92449ded1ed48e4502ff864542e47c23764ddc861e6834616f1acff52d46f1cdef16478aaa76a5d8f3f1bcc015cff6d3e4e3003c94ee621f61c04482e9ea0624f4ea317d05405221b6344136ef390bae9486bd30e02969577ecc123bc4acaa50ec6c2db7effe60a3ab8788fd66b60df0b5b8568e2a4d0c5eba822a6c14e29473201780b0a373b2b7025cf0b98607

# How to Initialize MongoDB Atlas and Insert data to MongoDB collection

In [23]:
from pymongo import MongoClient
mongo_client = MongoClient(MONGO_CONN_STR)

output_collection = mongo_client["sample_mflix"]["cohere_embed_movies"]
if output_collection.count_documents({})>0:
    output_collection.delete_many({})
e = output_collection.insert_many(results)

# Programatically create Vector Search and Full text search index in MongoDB Atlas for the respective collection used in the tutorial

In [117]:
output_collection.create_search_index({"definition":
        {"mappings":
         {"dynamic": True,
          "fields": {
            "embedding" : {
                "dimensions": 1024,
                "similarity": "cosine",
                "type": "vector"
                }
            }}},
     "name": "default"
    }
)

'default'

# Python function to embed search term/phrase and perform vectors search query using `$vectorsearch` operator

In [32]:
def query_vector_search(q, prefilter = {}, postfilter = {},path="embedding",topK=2):
    ele = co_client.embed(model="embed-english-v3.0",input_type="search_document",texts=[q])
    query_embedding = ele.embeddings[0]
    vs_query = {
                "index": "default",
                "path": path,
                "queryVector": query_embedding,
                "numCandidates": 10,
                "limit": topK,
            }
    if len(prefilter)>0:
        vs_query["filter"] = prefilter
    new_search_query = {"$vectorSearch": vs_query}
    project = {"$project": {"score": {"$meta": "vectorSearchScore"},"_id": 0,"title": 1, "release_date": 1, "overview": 1,"year": 1}}
    if len(postfilter.keys())>0:
        postFilter = {"$match":postfilter}
        res = list(output_collection.aggregate([new_search_query, project, postFilter]))
    else:
        res = list(output_collection.aggregate([new_search_query, project]))
    return res

# Vector search query example

In [33]:
query_vector_search("romantic comedy movies", topK=5)

[{'overview': 'German Comedy',
  'title': 'Honeymoon',
  'year': 1996,
  'score': 0.7810907959938049},
 {'overview': 'Seven close friends reunite for the wedding of two of their friends. Problems arise because the bride and the maid of honor have had a long rivalry over the groom.',
  'title': 'The Romantics',
  'year': 2010,
  'score': 0.7723205089569092},
 {'overview': 'Spoof of romantic comedies which focuses on a man (Campbell), his crush (Hannigan), his parents (Coolidge, Willard), and her father (Griffin).',
  'title': 'Date Movie',
  'year': 2006,
  'score': 0.7711508274078369},
 {'overview': 'A film editor breaks up with his girlfriend, unsure if he is in love.',
  'title': 'Modern Romance',
  'year': 1981,
  'score': 0.7707068920135498},
 {'overview': 'A pair of lovers have no idea that their affair is overseen and orchestrated by an angel sent by Cupid to see that they remain together in this romantic comedy.',
  'title': "Lover's Knot",
  'year': 1996,
  'score': 0.769820392

# Vector search query example with prefilter

In [37]:
# Romantic comedy movies released before 1990
query_vector_search("romantic comedy movies", prefilter={"year":{"$lt": 1990}}, topK=5)

[{'overview': 'A film editor breaks up with his girlfriend, unsure if he is in love.',
  'title': 'Modern Romance',
  'year': 1981,
  'score': 0.7707068920135498},
 {'overview': 'A 1938 romantic comedy film which tells the story of a teenage boy who becomes entangled with three different girls all at the same time.',
  'title': 'Love Finds Andy Hardy',
  'year': 1938,
  'score': 0.7665838003158569},
 {'overview': 'Bickering husband-and-wife stage stars are manipulated into a break-up for publicity purposes.',
  'title': 'Sweethearts',
  'year': 1938,
  'score': 0.7654489278793335},
 {'overview': 'Two couples go to a mutual friends wedding, and end up swapping partners.',
  'title': 'Cousins',
  'year': 1989,
  'score': 0.7594757080078125},
 {'overview': 'A couple struggle to find happiness after a whirlwind courtship.',
  'title': 'Made for Each Other',
  'year': 1939,
  'score': 0.7590867280960083}]

# Vector search query example with prefilter and postfilter to control the semantic search relavancy and behaviour

In [43]:
query_vector_search("romantic comedy movies", prefilter={"year":{"$lt": 1990}}, postfilter={"score": {"$gt":0.76}},topK=5)

[{'overview': 'A film editor breaks up with his girlfriend, unsure if he is in love.',
  'title': 'Modern Romance',
  'year': 1981,
  'score': 0.7707068920135498},
 {'overview': 'A 1938 romantic comedy film which tells the story of a teenage boy who becomes entangled with three different girls all at the same time.',
  'title': 'Love Finds Andy Hardy',
  'year': 1938,
  'score': 0.7665838003158569},
 {'overview': 'Bickering husband-and-wife stage stars are manipulated into a break-up for publicity purposes.',
  'title': 'Sweethearts',
  'year': 1938,
  'score': 0.7654489278793335}]

# Leverage MongoDB full text and Cohere Rerank Module to rank retrieved results

In [48]:
def query_fulltext_search(q,topK=25):
    v = {"$search": {
      "text": {
        "query": q,
        "path":"overview"
      }
    }}
    project = {"$project": {"score": {"$meta": "searchScore"},"_id": 0,"title": 1, "release-date": 1, "overview": 1}}
    docs = list(output_collection.aggregate([v,project, {"$limit":topK}]))
    return docs

In [49]:
# results before re ranking
docs = query_fulltext_search("romantic comedy movies", topK=10)
docs

[{'overview': 'A 1993 Bollywood Comedy starring Govinda and Chunky Pandey. This movie kick started a long line  of Comedy movies for Govinda.',
  'title': 'Aankhen',
  'score': 5.915504455566406},
 {'overview': "Valentin Vaala's crazy, romantic comedy.",
  'title': 'Morsian yllättää',
  'score': 5.859979629516602},
 {'overview': 'Romantic comedy about aspiring writers in NY.',
  'title': 'Broadway Damage',
  'score': 5.725950241088867},
 {'overview': 'Romantic comedy adapted from a Somerset Maugham play.',
  'title': 'Too Many Husbands',
  'score': 5.661208629608154},
 {'overview': 'A romantic comedy about the invention of the vibrator.',
  'title': 'Hysteria',
  'score': 5.597914695739746},
 {'overview': 'A romantic comedy set on Christmas Eve in Warsaw and centered around a series of characters.',
  'title': 'Letters to Santa',
  'score': 5.1916093826293945},
 {'overview': 'A romantic comedy about a neurotic girl who makes a vow of solitude after getting dumped.',
  'title': 'Lovely 

In [50]:
q = "romantic comedy movies"
docs = query_fulltext_search(q)
results = co_client.rerank(query=q, documents=list(map(lambda x:x["overview"], docs)), top_n=5, model='rerank-english-v2.0') # Change top_n to change the number of results returned. If top_n is not passed, all results will be returned.
for idx, r in enumerate(results):
    print(f"Document Rank: {idx + 1}, Document Index: {r.index}")
    print(f"Document Title: {docs[r.index]['title']}")
    print(f"Document: {r.document['text']}")
    print(f"Relevance Score: {r.relevance_score:.2f}")
    print("\n")

Document Rank: 1, Document Index: 22
Document Title: Love Finds Andy Hardy
Document: A 1938 romantic comedy film which tells the story of a teenage boy who becomes entangled with three different girls all at the same time.
Relevance Score: 0.99


Document Rank: 2, Document Index: 12
Document Title: Seventh Heaven
Document: Seventh Heaven or De zevende zemel is a 1993 Dutch romantic comedy film directed by Jean-Paul Lilienfeld.
Relevance Score: 0.99


Document Rank: 3, Document Index: 19
Document Title: Shared Rooms
Document: A new romantic comedy feature film that brings together three interrelated tales of gay men seeking family, love and sex during the holiday season.
Relevance Score: 0.97


Document Rank: 4, Document Index: 3
Document Title: Too Many Husbands
Document: Romantic comedy adapted from a Somerset Maugham play.
Relevance Score: 0.97


Document Rank: 5, Document Index: 20
Document Title: Walking the Streets of Moscow
Document: "I Am Walking Along Moscow" aka "Ya Shagayu Po