# Embeddings Retrieval with Claude and MongoDB Atlas vector search

This notebook provides a step-by-step guide for using the Embedding search tool with Claude. We will:

1. Set up the environment and imports
2. Load documents into a Mongodb Atlas vector store in remote.
3. Build a search tool to query the Brave search engine
4. Test the search tool  
5. Create a Claude client with access to the tool 
6. Compare Claude's responses with and without access to the tool

## Imports and Configuration 

First we'll import libraries and load environment variables as required in the notebook. This includes setting up logging so we can monitor the process.

In [1]:
import os
import sys
import dotenv
import anthropic

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))

import claude_retriever
from claude_retriever.searcher.embedders.local import LocalEmbedder
from claude_retriever.constants import DEFAULT_EMBEDDER
from claude_retriever.searcher.vectorstores.mongodb import MongoDBAtlasVectorStore

# Load environment variables
dotenv.load_dotenv()

True

The first step is setting up your datastore. Here, we will make use of the [Kaggle Amazon Products 2020 Dataset](https://www.kaggle.com/datasets/promptcloud/amazon-product-dataset-2020). It contains 10000 products from Amazon, including their product title, description, price, category tags, etc. For the purposes of this notebook, we've pre-processed the data to concatenate the title, description and category tags into a single "document" field and saved it locally as a JSONL with one line for each product.

We now need to transform this raw text dataset into an embedding dataset. In this notebook we will opt for the simplest possible way to do this locally:

1. We will use the [sentence-transformers](https://www.sbert.net/index.html) library, which allows us to use a lightweight model to embed our text data using only a CPU if that is all we have available.
2. We will save the text/embedding pairs on disk as a JSONL file that can be loaded in memory on the fly.

In [13]:
import pandas as pd

df = pd.read_json("./data/amazon-products.jsonl", lines=True)
df

Unnamed: 0,text
0,Product Name: DB Longboards CoreFlex Crossbow ...
1,Product Name: Electronic Snap Circuits Mini Ki...
2,Product Name: 3Doodler Create Flexy 3D Printin...
3,Product Name: Guillow Airplane Design Studio w...
4,Product Name: Woodstock- Collage 500 pc Puzzle...
...,...
9997,Product Name: Remedia Publications REM536B Mon...
9998,Product Name: Trends International NFL La Char...
9999,Product Name: NewPath Learning 10 Piece Scienc...
10000,Product Name: Disney Princess Do It Yourself B...


## Remote retrieval

Local methods like the Local retriver work quite well for small datasets, but for larger datasets you may want to consider using a cloud-based method to both create the embeddings and store the vector datastore. In this example, we create a [MongoDB Atlas](https://www.mongodb.com/products/platform/atlas-vector-search) vector datastore.

In [2]:
conn_str=os.environ.get('MONGO_CONNECTION_STR')
db_name= os.environ.get('MONG_DB_NAME')
col_name= os.environ.get('MONG_COLLECTION_NAME')
vectorstore = MongoDBAtlasVectorStore(conn_str=conn_str, db_name=db_name, col_name=col_name, embedding=LocalEmbedder(DEFAULT_EMBEDDER))

Load and index data into mongodb.

In [7]:
docs = df.to_dict(orient='records')
vectorstore._collection.delete_many({})
vectorstore._load_index_embeddings(docs, 128)

79it [09:55,  7.54s/it]


## Create a search tool
Using the vector store we just populated, let's create an EmbeddingSearchTool.

In [3]:
from claude_retriever.searcher.searchtools.embeddings import EmbeddingSearchTool

AMAZON_SEARCH_TOOL_DESCRIPTION = 'The search engine will search over the Amazon Product database, and return for each product its title, description, and a set of tags.'

amazon_search_tool = EmbeddingSearchTool(tool_description=AMAZON_SEARCH_TOOL_DESCRIPTION,
                                         vector_store = vectorstore)

Let's test to see if the search tool works!

In [4]:
dinos = amazon_search_tool.search("fun kids dinosaur book", n_search_results_to_use=1)
print(dinos)


<search_results>
<item index="1">
<page_content>
Product Name: LeapFrog Dino's Delightful Day Alphabet Book, Green

About Product: Letters and words are woven into the story in alphabetical order with phonetic sounds to introduce ABCs to your little one through a charming tale | Flip through the 16 interactive pages to hear the story read aloud, or enjoy musical play by jamming to a melody with fun sounds and musical notes | Press the light-up button to hear letter names, letter sounds and words from the story | Number buttons along Dino's back introduce counting and recognizing numbers from one to ten | This complete story with beginning, middle and end exposes your child to early reading skills. 2AA batteries are included for demo purposes, replace new batteries for regular use. Product dimensions: 12.3" Wide x 12.5" Height x 2.7" Depth

Categories: Toys & Games | Learning & Education | Science Kits & Toys
</page_content>
</item>
</search_results>


## Use Claude with Retrieval
We can now simply pass this search tool to Claude to use in retrieval.

Also, Here is the basic response to the query (with no access to the tool).

In [9]:
ANTHROPIC_SEARCH_MODEL = "claude-2"

client = claude_retriever.ClientWithRetrieval(api_key=os.environ['ANTHROPIC_API_KEY'], search_tool = amazon_search_tool)

query = "I want to get my daughter more interested in science. What kind of gifts should I get her?"
prompt = f'{anthropic.HUMAN_PROMPT} {query}{anthropic.AI_PROMPT}'

basic_response = client.completions.create(
    prompt=prompt,
    stop_sequences=[anthropic.HUMAN_PROMPT],
    model=ANTHROPIC_SEARCH_MODEL,
    max_tokens_to_sample=1000,
)
print('-'*50)
print('Basic response:')
print(prompt + basic_response.completion)
print('-'*50)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
--------------------------------------------------
Basic response:


Human: I want to get my daughter more interested in science. What kind of gifts should I get her?

Assistant: Here are some science-themed gift ideas to help spark your daughter's interest:

- A kids telescope or microscope set - These allow hands-on exploration and discovery. Look for ones designed specifically for children.

- Science kits - Kits for chemistry, biology, physics, robotics, etc. allow her to do fun experiments and activities. Choose age-appropriate ones. 

- Science books - Books about space, animals, nature, engineering, etc. can feed her curiosity. Look for engaging titles with lots of visuals.

- Science toys - There are

Now we get the same completion, but give Claude the ability to use the tool when thinking about the response.

In [10]:
augmented_response = client.completion_with_retrieval(
    query=query,
    model=ANTHROPIC_SEARCH_MODEL,
    n_search_results_to_use=3,
    max_tokens_to_sample=1000)

print('-'*50)
print('Augmented response:')
print(prompt + augmented_response)
print('-'*50)

--------------------------------------------------
Augmented response:


Human: I want to get my daughter more interested in science. What kind of gifts should I get her?

Assistant: Based on the search results, here are some science kit recommendations to get your daughter more interested in science:

- The Scientific Explorer My First Science Kids Science Experiment Kit looks like a great starter kit for a young child. It has different experiments to spark creativity and curiosity, and teaches STEM principles through open-ended play. 

- The Hey! Play! Kids Science Kit is another good beginner kit that focuses on mixing substances and making things like litmus paper. It's designed for hands-on learning and uses common household items for the experiments.

- For an older child, the Scientific Explorer Mind Blowing Science Kit bundles magic and science experiments to teach about chemical reactions. The kits include various chemicals to create fun reactions.

- The Educational Insights 

Often, you'll want finer-grained control around how exactly Claude uses the results. For this workflow we recommend "retrieve then complete".

In [11]:
relevant_search_results = client.retrieve(
    query=query,
    stop_sequences=[anthropic.HUMAN_PROMPT, 'END_OF_SEARCH'],
    model=ANTHROPIC_SEARCH_MODEL,
    n_search_results_to_use=3,
    max_searches_to_try=5,
    max_tokens_to_sample=1000)

print('-'*50)
print('Relevant results:')
print(relevant_search_results)
print('-'*50)

--------------------------------------------------
Relevant results:
[SearchResult(content='Product Name: Hey! Play! Kids Science Kit-Lab Set to Create Solutions, Litmus Paper, & More-Great Fun & Educational Stem Learning Activity for Boys & Girls\n\nAbout Product: Hands on learning- equipped with 4 test tubes and a holding rack, 2 beakers, dropper, measuring spoon, funnel, 3 grams of purple sweet potato powder and 10 sheets of paper filter, This is an excellent Basic starter science kit for kids! | Uses household items- The items needed for experiments that are not included with the kit are everyday items, that are easily found around the house, like scissors, plastic wrap, vinegar, baking soda, and water. | Stem activity- The science kit is a fantastic STEM (science, technology, engineering, Math) learning toy that will help your kids understand the concepts of mixing substances like acid and alkaline liquids and making things like litmus paper. | Hours of fun- this set is a wonderfu

Here we create a new prompt for answering the user's query using the retrieved search results.

In [12]:
qa_prompt = f'''{anthropic.HUMAN_PROMPT} You are a friendly product recommender. Here is a query issued by a user looking for product recommendations:

{query}

Here are a set of search results that might be helpful for answering the user's query:

{relevant_search_results}

Once again, here is the user's query:

<query>{query}</query>

Please write a response to the user that answers their query and provides them with helpful product recommendations. Feel free to use the search results above to help you write your response, or ignore them if they are not helpful.

At the end of your response, under "Products you might like:", list the top 3 product names from the search results that you think the user would most like.

Please ensure your results are in the following format:

<result>
Your response to the user's query.
</result>
<recommendations>
Products you might like:
1. Product name
2. Product name
3. Product name
</recommendations>{anthropic.AI_PROMPT}'''

response = client.completions.create(
    prompt=qa_prompt,
    stop_sequences=[anthropic.HUMAN_PROMPT],
    model=ANTHROPIC_SEARCH_MODEL,
    max_tokens_to_sample=1000,
)

print('-'*50)
print('Response:')
print(response.completion)
print('-'*50)

--------------------------------------------------
Response:
 <result>

Here are some great gift ideas to get your daughter more interested in science:

For younger kids, a science kit with hands-on experiments like the Hey! Play! Kids Science Kit can introduce basic concepts like mixing substances and chemistry in a fun way. Kits tailored for their age with safe materials are ideal. The Scientific Explorer My First Science Kit has beginner experiments as well, like growing crystals and exploring color. 

For older elementary school ages, try gifts that let them design their own experiments like the Mind Blowing Science Kit. This allows them to explore cause and effect and use science tools in an open-ended way. Something like the Engineering with Ramps Set would also let them build different ramp configurations to see concepts like force and motion at work.

Books and learning games are great too. The Sci-Ology game teaches about famous scientists in a fun way. And workbooks like the 