In [5]:
!pip install langchain langchain-pinecone langchain-huggingface pinecone-client sentence-transformers dotenv

Collecting langchain
  Using cached langchain-1.0.0-py3-none-any.whl.metadata (4.6 kB)
Collecting langchain-pinecone
  Using cached langchain_pinecone-0.2.12-py3-none-any.whl.metadata (8.6 kB)
Collecting langchain-huggingface
  Using cached langchain_huggingface-1.0.0-py3-none-any.whl.metadata (2.1 kB)
Collecting langchain-core<2.0.0,>=1.0.0 (from langchain)
  Using cached langchain_core-1.0.0-py3-none-any.whl.metadata (3.4 kB)
Collecting langgraph<1.1.0,>=1.0.0 (from langchain)
  Using cached langgraph-1.0.0-py3-none-any.whl.metadata (7.4 kB)
Collecting pydantic<3.0.0,>=2.7.4 (from langchain)
  Using cached pydantic-2.12.3-py3-none-any.whl.metadata (87 kB)
INFO: pip is looking at multiple versions of langchain-pinecone to determine which version is compatible with other requirements. This could take a while.
Collecting langchain-pinecone
  Using cached langchain_pinecone-0.2.11-py3-none-any.whl.metadata (6.1 kB)
  Using cached langchain_pinecone-0.2.10-py3-none-any.whl.metadata (5.3 k


[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## 1. Setup Environment
First, let's load our API keys and other configuration details from a `.env` file. This is a best practice for keeping sensitive information out of the notebook.

Create a file named `.env` in your project's root directory and add your keys like this:

In [3]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

# Check if keys are loaded
if not PINECONE_API_KEY or not GOOGLE_API_KEY:
    print("API keys not found. Please create a .env file and add your keys.")
else:
    print("API keys loaded successfully.")

# This is the name of our index in Pinecone
PINECONE_INDEX_NAME = "ikarus"

API keys loaded successfully.


## 2. Load Cleaned Data
We'll now load the `cleaned_data.csv` file that we prepared in the previous notebook.

In [4]:
import pandas as pd

# Load the dataset
df = pd.read_csv('cleaned_data.csv')

# Handle potential empty rows
df.dropna(subset=['uniq_id', 'combined_text'], inplace=True)

# Display the first few rows
print(f"Loaded {len(df)} records from cleaned_data.csv")
df.head()

Loaded 312 records from cleaned_data.csv


Unnamed: 0,uniq_id,title,brand,price,images,categories,material,color,combined_text
0,02593e81-5c09-5069-8516-b0b29f439ded,"GOYMFK 1pc Free Standing Shoe Rack, Multi-laye...",GOYMFK,24.99,['https://m.media-amazon.com/images/I/416WaLx1...,"['Home & Kitchen', 'Storage & Organization', '...",Metal,White,"GOYMFK 1pc Free Standing Shoe Rack, Multi-laye..."
1,5938d217-b8c5-5d3e-b1cf-e28e340f292e,"subrtex Leather ding Room, Dining Chairs Set o...",subrtex,53.99,['https://m.media-amazon.com/images/I/31SejUEW...,"['Home & Kitchen', 'Furniture', 'Dining Room F...",Sponge,Black,"subrtex Leather ding Room, Dining Chairs Set o..."
2,b2ede786-3f51-5a45-9a5b-bcf856958cd8,Plant Repotting Mat MUYETOL Waterproof Transpl...,MUYETOL,5.98,['https://m.media-amazon.com/images/I/41RgefVq...,"['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo...",Polyethylene,Green,Plant Repotting Mat MUYETOL Waterproof Transpl...
3,8fd9377b-cfa6-5f10-835c-6b8eca2816b5,"Pickleball Doormat, Welcome Doormat Absorbent ...",VEWETOL,13.99,['https://m.media-amazon.com/images/I/61vz1Igl...,"['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo...",Rubber,A5589,"Pickleball Doormat, Welcome Doormat Absorbent ..."
4,bdc9aa30-9439-50dc-8e89-213ea211d66a,JOIN IRON Foldable TV Trays for Eating Set of ...,JOIN IRON Store,89.99,['https://m.media-amazon.com/images/I/41p4d4VJ...,"['Home & Kitchen', 'Furniture', 'Game & Recrea...",Iron,Grey Set of 4,JOIN IRON Foldable TV Trays for Eating Set of ...


## 3. Initialize the Embedding Model with LangChain
We will use a sentence-transformer model from HuggingFace to convert our `combined_text` field into dense vector embeddings. `all-MiniLM-L6-v2` is a great choice as it's efficient and effective for semantic search tasks. LangChain's `HuggingFaceEmbeddings` wrapper makes this incredibly simple.

In [5]:
from langchain_huggingface import HuggingFaceEmbeddings

# Initialize the embedding model
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {"device": "cpu"} # Use "cuda" if you have a GPU
encode_kwargs = {"normalize_embeddings": False}

embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

print("Embedding model initialized.")

Embedding model initialized.


## 4. Set Up Pinecone Vector Store
Now, we'll connect to Pinecone. We will check if our desired index already exists. If not, we will create it.

**Important:** The `dimension` of the index *must* match the output dimension of our embedding model. For `all-MiniLM-L6-v2`, this is **384**.

In [6]:
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore

# Initialize Pinecone client
pc = Pinecone(api_key=PINECONE_API_KEY)

# Check if the index already exists
if PINECONE_INDEX_NAME not in pc.list_indexes().names():
    print(f"Creating index '{PINECONE_INDEX_NAME}'...")
    pc.create_index(
        name=PINECONE_INDEX_NAME,
        dimension=384,  # This MUST match your embedding model
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )
    print("Index created successfully.")
else:
    print(f"Index '{PINECONE_INDEX_NAME}' already exists.")

# Get the index object
index = pc.Index(PINECONE_INDEX_NAME)

# Pass the index to LangChain
vectorstore = PineconeVectorStore(index=index, embedding=embeddings)
print("Pinecone vector store initialized successfully.")


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from langchain_pinecone.vectorstores import Pinecone, PineconeVectorStore


Index 'ikarus' already exists.
Pinecone vector store initialized successfully.


## 5. Generate and Upsert Embeddings
This is the final and most important step. We will iterate through our DataFrame in batches, generate embeddings for the `combined_text` of each product, and then "upsert" (upload/insert) them into our Pinecone index.

The `metadata` for each vector will include the product's title, brand, price, and image URLs, so we can retrieve this information during our search without needing another database lookup.

In [7]:
from tqdm.auto import tqdm  # For progress bar

# We'll process the data in batches to be efficient
batch_size = 100

for i in tqdm(range(0, len(df), batch_size), desc="Upserting to Pinecone"):
    # Get the batch of data
    i_end = min(i + batch_size, len(df))
    batch = df.iloc[i:i_end]

    # Extract fields
    ids = batch["uniq_id"].astype(str).tolist()
    texts = batch["combined_text"].astype(str).tolist()

    # Prepare metadata — ensure all fields are JSON serializable
    metadata = [
        {
            "title": str(row.get("title", "")),
            "brand": str(row.get("brand", "")),
            "price": float(row.get("price", 0.0)) if pd.notnull(row.get("price")) else 0.0,
            "images": str(row.get("images", "[]")),
        }
        for _, row in batch.iterrows()
    ]

    # ✅ Add documents to Pinecone via LangChain vectorstore
    # (LangChain handles embedding and upsert automatically)
    vectorstore.add_texts(texts=texts, ids=ids, metadatas=metadata)

print("\n--- Embedding and Upserting Complete ---")
print(f"All {len(df)} product records have been processed and stored in the '{PINECONE_INDEX_NAME}' index.")


Upserting to Pinecone:   0%|          | 0/4 [00:00<?, ?it/s]


--- Embedding and Upserting Complete ---
All 312 product records have been processed and stored in the 'ikarus' index.


### Verification (Optional)
You can run a quick similarity search to verify that the data has been indexed correctly.

In [8]:
# Run a quick test query
query = "A comfortable chair for a living room"

try:
    results = vectorstore.similarity_search(query, k=3)

    print(f"Results for query: '{query}'\\n")
    for doc in results:
        print(f"Title: {doc.metadata.get('title')}")
        print(f"Brand: {doc.metadata.get('brand')}")
        print(f"Price: ${doc.metadata.get('price')}")
        print("-" * 30)

except Exception as e:
    print(f"An error occurred during the test query: {e}")
    print("This might happen if the index is still initializing. Please wait a few minutes and try again.")

Results for query: 'A comfortable chair for a living room'\n
Title: Karl home Accent Chair Mid-Century Modern Chair with Pillow Upholstered Lounge Arm Chair with Solid Wood Frame & Soft Cushion for Living Room, Bedroom, Belcony, Beige
Brand: Karl home Store
Price: $149.99
------------------------------
Title: Ergonomic Office Chair,Office Chair, with Lumbar Support & 3D Headrest & Flip Up Arms Home Office Desk Chairs Rockable High Back Swivel Computer Chair White Frame Mesh Study Chair（All Black）
Brand: SCaua
Price: $126.99
------------------------------
Title: Lazy Chair with Ottoman, Modern Lounge Accent Chair with Footrest, Pillow and Blanket, Leisure Sofa Chair Reading Chair with Armrests and Side Pocket for Living Room, Bedroom & Small Space, Grey
Brand: WARMGIFT WM
Price: $139.99
------------------------------
