Saving Electronics Dataset to Pinecone for Rerank API

This notebook demonstrates how to save the Electronics Dataset into Pinecone, structured to match the rerank embedding API from the Fleak AI webinar.

Dataset:
- The dataset used in this notebook is the Electronics Dataset from Kaggle:
https://www.kaggle.com/datasets/elvinrustam/electronics-dataset/data

References:
- Fleak AI Webinar Documentation: https://docs.fleak.ai/1.0/tutorials/product-recommendation-with-rerank

Required Libraries
Make sure you have the following libraries installed:
- pandas
- pinecone-client
- sentence-transformers
- tqdm

You can install them using pip:
pip install pandas pinecone-client sentence-transformers tqdm

In [None]:
# Import Libraries

import pandas as pd
import pinecone
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm

In [None]:
# Load and Prepare Data

# Load the Electronics Dataset
# Make sure you have downloaded the 'electronics.csv' file from Kaggle and placed it in the same directory as this notebook
df = pd.read_csv("electronics.csv")

# Execute the SQL-like transformation
df['description'] = df['title'] + ': ' + df['feature']
df = df[['title', 'Sub Category', 'Price', 'discount', 'rating', 'currency', 'description']]
df = df.rename(columns={'Sub Category': 'subcategory', 'Price': 'price'})

# Display the first few rows of the transformed dataset
print(df.head())

In [None]:
## Initialize SentenceTransformer Model

model = SentenceTransformer('all-MiniLM-L6-v2')

# Function to create embeddings
def create_embeddings(texts):
    return model.encode(texts).tolist()

In [None]:
## Prepare Data for Pinecone

data = []
for _, row in tqdm(df.iterrows(), total=len(df)):
    item = {
        'id': str(row.name),  # Using the index as ID
        'metadata': {
            'title': row['title'],
            'subcategory': row['subcategory'],
            'price': row['price'],
            'discount': row['discount'],
            'rating': row['rating'],
            'currency': row['currency'],
            'description': row['description']
        },
        'values': create_embeddings(row['description'])
    }
    data.append(item)

In [None]:
## Initialize Pinecone

# Replace 'YOUR_API_KEY' and 'YOUR_ENVIRONMENT' with your actual Pinecone credentials
pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")

# Create or connect to an index
index_name = "electronics-rerank"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(index_name, dimension=384, metric="cosine")

index = pinecone.Index(index_name)

In [None]:
## Upsert Data to Pinecone

# Upsert data in batches
batch_size = 100
for i in tqdm(range(0, len(data), batch_size)):
    batch = data[i:i+batch_size]
    index.upsert(vectors=batch)

print("Data successfully uploaded to Pinecone!")

In [None]:
## Verify Data in Pinecone

# Query the index to check if data was uploaded correctly
query_response = index.query(
    vector=data[0]['values'],
    top_k=1,
    include_metadata=True
)

print("Sample query result:")
print(query_response)

To use this notebook:
- Create a new Jupyter notebook and copy this code into it.
- Download the 'electronics.csv' file from the Kaggle dataset and place it in the same directory as your notebook.
- Replace "YOUR_API_KEY" and "YOUR_ENVIRONMENT" with your actual Pinecone credentials.
- Run each cell in order, making sure to install any missing libraries.

This structured notebook provides a clear, step-by-step process for saving the Electronics Dataset to Pinecone, matching the structure used in the rerank embedding API from the Fleak AI webinar.