# Instacart Matching

For this use case we are looking to take a list of ingredients provided by a customer and search for the best matches from the Instacart catalogue

To accomplish this we will:
- Load a sample list of Instacart products
- Create embeddings based on these
- Generate lists of ingredients and compare them with the embeddings to find the most similar items to present

## Setup

First we will import all the packages we'll need

In [1]:
# imports
import openai  # OpenAI Python library to make API calls
import pandas as pd 
import os

# set API key
openai.api_key = os.environ.get("OPENAI_API_KEY")

# set data directory
data_dir = os.path.join(os.pardir,'data')

## Load Data

Next we'll load in our list of Instacart products

In [3]:
from openai.embeddings_utils import (
    get_embedding,
    distances_from_embeddings,
    tsne_components_from_embeddings,
    chart_from_components,
    indices_of_nearest_neighbors_from_distances,
)

In [4]:
products = pd.read_csv(os.path.join(data_dir,os.listdir(data_dir)[0]))
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


## Create Embeddings



In [10]:
# Trimmed to top 1000 products
# This takes a while with the whole dataset, so manage your expectations accordingly
products_trimmed = products[:1000]
products_trimmed.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [None]:
# This will take just under 10 minutes
products_trimmed['babbage_similarity'] = products_trimmed.product_name.apply(lambda x: get_embedding(x, engine='text-similarity-babbage-001'))
products_trimmed['babbage_search'] = products_trimmed.product_name.apply(lambda x: get_embedding(x, engine='text-search-babbage-doc-001'))
products_trimmed.to_csv(os.path.join(data_dir,'products_with_embeddings.csv'))

In [17]:
products_trimmed

Unnamed: 0,product_id,product_name,aisle_id,department_id,babbage_similarity,babbage_search,similarities
0,1,Chocolate Sandwich Cookies,61,19,"[0.01066532451659441, -0.00783467572182417, -0...","[0.0188217144459486, 0.015092567540705204, 0.0...",0.362517
1,2,All-Seasons Salt,104,13,"[-0.0024860778357833624, 0.01097782514989376, ...","[-0.005451497621834278, 0.035369835793972015, ...",0.187338
2,3,Robust Golden Unsweetened Oolong Tea,94,7,"[0.007721468340605497, 0.015294842422008514, 0...","[0.004352790769189596, 0.03498658165335655, 0....",0.219047
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,"[-0.0015017695259302855, 0.018505744636058807,...","[-0.006066039204597473, 0.03014267608523369, -...",0.212363
4,5,Green Chile Anytime Sauce,5,13,"[-0.0007468880503438413, 0.01609272137284279, ...","[0.006646996829658747, 0.038843002170324326, -...",0.209928
...,...,...,...,...,...,...,...
995,996,Honey Cinnamon Nut-Thins Crackers,125,19,"[0.006676878314465284, -0.0037154615856707096,...","[0.006386340595781803, 0.013193758204579353, -...",0.264954
996,997,Mini Double Chocolate Ice Cream Bars,37,1,"[-0.0070345643907785416, -0.005980468820780516...","[-0.00492378743365407, 0.012434347532689571, 0...",0.298593
997,998,Hot Chopped Green Chili,104,13,"[-0.006449181120842695, 0.0064157224260270596,...","[-0.007653192151337862, 0.026360996067523956, ...",0.214765
998,999,Original Organic Ville BBQ Sauce,5,13,"[0.001610441948287189, 0.019470669329166412, 0...","[-0.003057758789509535, 0.0393008217215538, 0....",0.194476


## Create recommender based on similarity

We'll create a function which finds top results based on cosine similarity when providing an ingredient

In [14]:
from openai.embeddings_utils import cosine_similarity

In [24]:
# search through the reviews for a specific product
def search_reviews(df, product_name, n=3, pprint=True):
    embedding = get_embedding(
        product_name,
        engine="text-search-babbage-query-001"
    )
    df["similarities"] = df.babbage_search.apply(lambda x: cosine_similarity(x, embedding))
    
    # Here we choose to sort by similarity and take the top "n" results i.e. the top X with the best match
    res = (
        df.sort_values("similarities", ascending=False)
        .head(n)
    )
    '''if pprint:
        for r in res:
            print(r[:200])
            print()'''
    return res

In [25]:
# Green beans seems to work ok
res = search_reviews(products_trimmed, "green beans", n=3)
for idx,row in res.iterrows():
    print(row['product_name'])

Steamfresh Chef's Favorites Lightly Sauced Roasted Red Potatoes & Green Beans
Garbanzo Beans
Petite Green Peas


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["similarities"] = df.babbage_search.apply(lambda x: cosine_similarity(x, embedding))


In [22]:
# So does chocolate chip cookies
for idx,row in res.iterrows():
    print(row['product_name'])

Cookie Chips Crunchy Dark Chocolate Chocolate Chip Cookies
Chocolate Sandwich Cookies
Gluten Free All Natural Chocolate Chip Cookies
