# Retrieval-Augmented Generation using Pinecone

This notebook demonstrates how to connect Claude with the data in your Pinecone vector database through a technique called retrieval-augmented generation (RAG). We will cover the following steps:

1. Embedding and upserting data into Pinecone using integrated inference
3. Querying an index from Pinecone to retrieve results
4. Using Claude to answer questions with information from the database

## Setup
First, let's install the necessary libraries and set the API keys we will need to use in this notebook. We will need to get a [Claude API key](https://docs.anthropic.com/en/api/getting-started), and a free [Pinecone API key](https://docs.pinecone.io/guides/get-started/quickstart).

In [107]:
%pip install anthropic pinecone

Note: you may need to restart the kernel to use updated packages.


In [108]:
# Insert your API keys here
import os
os.environ["ANTHROPIC_API_KEY"] = "INSERT_YOUR_API_KEY"

os.environ["PINECONE_API_KEY"] = "INSERT_YOUR_API_KEY"

## Download the dataset
Now let's download the Amazon products dataset which has over 10k Amazon product descriptions and load it into a DataFrame.

In [109]:

!curl -0 "https://www-cdn.anthropic.com/48affa556a5af1de657d426bcc1506cdf7e2f68e/amazon-products.jsonl" > amazon-products.jsonl


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 5559k  100 5559k    0     0  26.7M      0 --:--:-- --:--:-- --:--:-- 26.7M


In [110]:
import pandas as pd

data = []
with open('amazon-products.jsonl', 'r') as file:
    for line in file:
        try:
            data.append(eval(line))
        except:
            pass

df = pd.DataFrame(data)
display(df.head())
len(df)

Unnamed: 0,text
0,Product Name: DB Longboards CoreFlex Crossbow ...
1,Product Name: Electronic Snap Circuits Mini Ki...
2,Product Name: 3Doodler Create Flexy 3D Printin...
3,Product Name: Guillow Airplane Design Studio w...
4,Product Name: Woodstock- Collage 500 pc Puzzle...


10002

## Pinecone Vector Database and Integrated Inference

To create our index, we first need a free API key from Pinecone. We'll create an index with integrated inference, which
means we specify a model hosted on Pinecone to use for embedding queries and documents. Pinecone handles the embedding for us,
so we can pass it text directly. To learn more about integrated inference, [look here](https://docs.pinecone.io/guides/inference/understanding-inference#integrated-inference).

Once we have the key, we can initialize the index as follows:

In [111]:
from pinecone import Pinecone

pc = Pinecone()

dense_index_name = "rag-with-anthropic"
if not pc.has_index(dense_index_name):
    pc.create_index_for_model(
        name=dense_index_name,
        cloud="aws",
        region="us-east-1",
        # Chunk text will be the field we embed from our documents
        embed={
            "model":"llama-text-embed-v2",
            "field_map":{"text": "chunk_text"}
        }
    )

In [112]:

dense_index = pc.Index(dense_index_name)

dense_index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'amz-products': {'vector_count': 10002}},
 'total_vector_count': 10002,
 'vector_type': 'dense'}

We should see that the new Pinecone index has a total_vector_count of 0, as we haven't added any vectors yet.

## Embedding and Upserting data to Pinecone 

With our index set up, we can now take our product descriptions, embed them, and upsert.

In [113]:
from tqdm import tqdm

descriptions = df["text"].tolist()
batch_size = 96  # how many embeddings we create and insert at once

def upsert_and_embed_into_index(index, namespace, descriptions, batch_size=96):
    for i in tqdm(range(0, len(descriptions), batch_size)):
        # Iterate over descriptions in batches of 96, the max number of docs that can be embedded each call
        # find end of batch
        i_end = min(len(descriptions), i+batch_size)
        descriptions_batch = descriptions[i:i_end]

        records = [
            {
                "id": f"description_{i+idx}",
                "chunk_text": description,
            }
            for idx, description in enumerate(descriptions_batch)
        ]

        # embed and upsert into Pinecone. This operation does both!
        index.upsert_records(namespace=namespace, records=records)


# We specify a namespace to upsert our data into.
# We write a check to ensure that we only upsert if the index is empty or is missing vectors
embed_condition = dense_index.describe_index_stats()["total_vector_count"] == 0 or dense_index.describe_index_stats()["total_vector_count"] < len(descriptions)
if embed_condition:
    upsert_and_embed_into_index(dense_index,"amz-products", descriptions)


In [114]:
dense_index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'amz-products': {'vector_count': 10002}},
 'total_vector_count': 10002,
 'vector_type': 'dense'}

## Making queries

With our index populated, we can start making queries to get results. We can take a natural language question, embed it, and query it against the index to return semantically similar product descriptions.

In [115]:
USER_QUESTION = "I want to get my daughter more interested in science. What kind of gifts should I get her?"


# search_records embeds and queries the Pinecone index in one step
results = dense_index.search(
    namespace="amz-products", 
    query={
        # specifies number of results to return
        "top_k":5,
        # specifies the query to embed and search for
        "inputs":{
            "text": USER_QUESTION
        }
    }
)

results = results["result"]["hits"]

for num, result in enumerate(results):
    # Return result and score
    print(f"Result {num+1}:")
    print(result["fields"]["chunk_text"], result["_score"])
    print("\n")



Result 1:
Product Name: Hey! Play! Kids Science Kit-Lab Set to Create Solutions, Litmus Paper, & More-Great Fun & Educational Stem Learning Activity for Boys & Girls

About Product: Hands on learning- equipped with 4 test tubes and a holding rack, 2 beakers, dropper, measuring spoon, funnel, 3 grams of purple sweet potato powder and 10 sheets of paper filter, This is an excellent Basic starter science kit for kids! | Uses household items- The items needed for experiments that are not included with the kit are everyday items, that are easily found around the house, like scissors, plastic wrap, vinegar, baking soda, and water. | Stem activity- The science kit is a fantastic STEM (science, technology, engineering, Math) learning toy that will help your kids understand the concepts of mixing substances like acid and alkaline liquids and making things like litmus paper. | Hours of fun- this set is a wonderful gift for birthdays, holidays, or any occasion! Your little girl or boy will have h

## Implementing keyword search using Claude and pinecone-sparse

We can implement sparse search, an advanced variant of keyword search, by enriching user queries with Claude.

 Using Claude, we can take the user's question and generate search keywords from it. This allows us to perform a wide, diverse search over the index to get more relevant product descriptions.

 We'll make another index, this time with (pinecone-sparse)[https://www.pinecone.io/learn/learn-pinecone-sparse/], a sparse embedding model that is optimized for keyword search.

In [116]:
# Create our sparse index

sparse_index_name = "rag-with-anthropic-sparse"
if not pc.has_index(sparse_index_name):
    pc.create_index_for_model(
        name=sparse_index_name,
        cloud="aws",
        region="us-east-1",
        embed={
            "model":"pinecone-sparse-english-v0",
            "field_map":{"text": "chunk_text"}
        }
    )

sparse_index = pc.Index(sparse_index_name)

sparse_embed_condition = sparse_index.describe_index_stats()["total_vector_count"] == 0 or sparse_index.describe_index_stats()["total_vector_count"] < len(descriptions)

# This time, we'll create a sparse index and embeddings. Super easy with integrated inference!
if sparse_embed_condition:
    upsert_and_embed_into_index(sparse_index,"amz-products", descriptions)

sparse_index.describe_index_stats()

{'index_fullness': 0.0,
 'metric': 'dotproduct',
 'namespaces': {'amz-products': {'vector_count': 10002}},
 'total_vector_count': 10002,
 'vector_type': 'sparse'}

In [117]:
import anthropic

client = anthropic.Anthropic()
model = "claude-3-5-haiku-latest"

def get_completion(prompt):
    message = client.messages.create(
        model=model,
        max_tokens=1000,
        temperature=1,
        system="You are a keyword generating assistant. Given a user message, you'll generate keywords to search for products.",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"{prompt}"
                    }
                ]
            }
        ]
    )
    return message.content

In [118]:
def create_keyword_prompt(question):
    return f""" 


    Given a question, generate a list of 5 very diverse search keywords that can be used to search for products on Amazon.

    The question is: {question}

    Output your keywords as a JSON that has one property "keywords" that is a list of strings. Only output valid JSON."""


With our Anthropic client setup and our prompt created, we can now begin to generate keywords from the question. We will output the keywords in a JSON object so we can easily parse them from Claude's output.

In [119]:
keyword_json = get_completion(create_keyword_prompt(USER_QUESTION))

# grab the text, it will look like JSON in a text string
print(keyword_json[0].text)


{
    "keywords": [
        "stem toys for girls",
        "science experiment kit for kids",
        "microscope for children",
        "coding toy for young girls",
        "astronomy starter set"
    ]
}


In [120]:
import json

# Extract the keywords from the JSON
data = json.loads(keyword_json[0].text)
keywords_list = data['keywords']
print(keywords_list)

['stem toys for girls', 'science experiment kit for kids', 'microscope for children', 'coding toy for young girls', 'astronomy starter set']


Now with our keywords in a list, let's embed each one, query it against the index, and return the top 3 most relevant product descriptions.

In [121]:
results_list = []
for keyword in keywords_list:
    
    search_results = sparse_index.search_records(
        namespace="amz-products", 
        query={
                # specifies number of results to return
                "top_k":5,
                # specifies the query to embed and search for
                "inputs":{
                    "text": keyword
                }
        }
    )

    # append the search results to the list
    for search_result in search_results["result"]["hits"]:
            results_list.append(search_result['fields']['chunk_text'])


print("Some of the results")
for result in results_list[:5]:
    print(result)
    print("\n")
print(len(results_list))

Some of the results
Product Name: The Learning Journey Techno Gears Marble Mania STEM Construction Set – Catapult Marble Run (80+ pieces) – Learning Toys & Gifts for Boys & Girls Ages 6 Years and Up

About Product: KEY FEATURES – This STEM construction set features 80+ brightly colored interlocking pieces, base plates, 3-D connectors, a pendulum marble drop, a marble launcher loop, and a step-by-step instruction manual to guide you to the finished product. The Catapult aligns with STEM standards. | EDUCATIONAL BENEFITS – Supports STEM education as the Catapult builds science, technology, engineering, and mathematic skills. The mechanics of the structure include gears and will introduce the science of gear ratio. | ENHANCES COGNATIVE SKILLS – Supports problem solving, fluid reasoning, mechanics, and engineering skills. Builds confidence as children design structures that move and work. Boosts self-esteem by providing a sense of achievement once the toy is fully constructed. | SAFETY - A

## Answering with Claude

Now that we have a list of product descriptions, let's format them into a search template Claude has been trained with and pass the formatted descriptions into another prompt.

In [122]:
# Formatting search results
def format_results(extracted: list[str]) -> str:
        result = "\n".join(
            [
                f'<item index="{i+1}">\n<page_content>\n{r}\n</page_content>\n</item>'
                for i, r in enumerate(extracted)
            ]
        )
    
        return f"\n<search_results>\n{result}\n</search_results>"

def create_answer_prompt(results_list, question):
    return f"""\n\nHuman: {format_results(results_list)} Using the search results provided within the <search_results></search_results> tags, please answer the following question <question>{question}</question>. Do not reference the search results in your answer.\n\nAssistant:"""


In [123]:
print(create_answer_prompt(results_list, USER_QUESTION))



Human: 
<search_results>
<item index="1">
<page_content>
Product Name: The Learning Journey Techno Gears Marble Mania STEM Construction Set – Catapult Marble Run (80+ pieces) – Learning Toys & Gifts for Boys & Girls Ages 6 Years and Up

About Product: KEY FEATURES – This STEM construction set features 80+ brightly colored interlocking pieces, base plates, 3-D connectors, a pendulum marble drop, a marble launcher loop, and a step-by-step instruction manual to guide you to the finished product. The Catapult aligns with STEM standards. | EDUCATIONAL BENEFITS – Supports STEM education as the Catapult builds science, technology, engineering, and mathematic skills. The mechanics of the structure include gears and will introduce the science of gear ratio. | ENHANCES COGNATIVE SKILLS – Supports problem solving, fluid reasoning, mechanics, and engineering skills. Builds confidence as children design structures that move and work. Boosts self-esteem by providing a sense of achievement once the

Finally, let's ask the original user's question and get our answer from Claude.

In [124]:
answer = get_completion(create_answer_prompt(results_list, USER_QUESTION))


In [125]:
print(answer[0].text)

Based on the search results, here are some excellent gift ideas to help spark your daughter's interest in science:

1. Science Experiment Kits
- These are fantastic STEM (Science, Technology, Engineering, Math) learning toys that make science fun and hands-on
- Options include:
  * Scientific Explorer My First Science Kit (for ages 4+)
  * The Young Scientists Club Science Art Fusion Rainbows Kit (for ages 5+)
  * A Glow in the Dark Science Experiment Kit
  * An Optical Illusions Science Kit

2. Microscopes
- Great for exploring nature and developing scientific curiosity
- Kid-friendly options like:
  * Nature Explorer Microscope (for ages 3+)
  * Learning Resources Primary Microscope (for ages 6+)
  * IQCREW Kids Microscope Set with Slides

3. STEM Building Toys
- Encourage problem-solving and engineering skills
- Examples include:
  * Brackitz Inventor Building Kit
  * The Learning Journey Techno Gears Sets
  * Marble Run construction sets

These gifts are designed to be educational,

## Putting it all together

In [126]:
def answer_query_with_enriched_search(question):
    # Generate keywords
    keywords = get_completion(create_keyword_prompt(question))
    keywords_list = json.loads(keywords[0].text)['keywords']
    print("Found keywords:")
    print(keywords_list)

    
    # Search for keywords
    results_list = []
    for keyword in keywords_list:
    
        search_results = sparse_index.search_records(
            namespace="amz-products", 
            query={
                    # specifies number of results to return
                    "top_k":5,
                    # specifies the query to embed and search for
                    "inputs":{
                        "text": keyword
                    }
            }
        )

        # append the search results to the list
        for search_result in search_results["result"]["hits"]:
            results_list.append(search_result['fields']['chunk_text'])

    # Format results
    formatted_results = create_answer_prompt(results_list, question)

    # Answer the question
    answer = get_completion(formatted_results)
    return answer[0].text


print(answer_query_with_enriched_search(USER_QUESTION))

Found keywords:
['science experiment kit for girls', 'coding toys for young female learners', 'microscope for children', 'STEM educational gifts', 'astronomy telescope for kids']
Based on the search results, here are some excellent science-focused gift ideas to help spark your daughter's interest in science:

1. Educational Science Kits
- The Young Scientists Club Science Art Fusion Rainbows Kit (ages 5+)
- Hey! Play! Kids Science Kit with test tubes and experiments
- ScienceWiz Energy Experiment Kit (ages 8+)

2. STEM Learning Toys
- Engino Junior Robotics Set for learning coding and engineering
- The Learning Journey Techno Gears Marble Mania sets
- ALEX Toys Future Coders Robot Races Coding Skills Kit (ages 4+)

3. Exploration Tools
- Nature Explorer Microscope (allows kids to examine things up close)
- Kids Telescope for astronomy and nature observation
- Learning Resources Primary Microscope with multiple magnification levels

4. Hands-On Learning Options
- Design & Drill See-Thro

## Cleanup Indexes (Uncomment to run)

In [127]:
## Cleanup if needed. Uncomment and run!

#pc.delete_index(name=sparse_index_name)
#pc.delete_index(name=dense_index_name)
