# Instacart Matching

For this use case we are looking to take a list of ingredients provided by a customer and search for the best matches from the Instacart catalogue

To accomplish this we will:
- Load a sample list of Instacart products
- Create embeddings based on these
- Generate lists of ingredients and compare them with the embeddings to find the most similar items to present

## Setup

First we will import all the packages we'll need

In [4]:
# imports
import openai  # OpenAI Python library to make API calls
import pandas as pd 
import os
from openai.embeddings_utils import cosine_similarity

# set API key
openai.api_key = os.environ.get("OPENAI_API_KEY")

# set data directory
data_dir = os.path.join(os.pardir,'data')

# I've set this to our new embeddings model, this can be changed to the embedding model of your choice
EMBEDDING_MODEL = "text-embedding-ada-002"

## Load Data

Next we'll load in our list of Instacart products

In [5]:
# Simple function to take in a list of text objects and return them as a list of embeddings
def get_embedding(text,engine):
    response = openai.Embedding.create(
        input=text,
        model=engine,
    )
    #print(response)
    return response['data'][0]["embedding"]

In [6]:
# 49,688 Instacart products
products = pd.read_csv(os.path.join(data_dir,os.listdir(data_dir)[0]))
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


## Create Embeddings



In [7]:
# Trimmed to top 1000 products
# This takes a while with the whole dataset, so manage your expectations accordingly
products_trimmed = products[:1000]
products_trimmed.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [8]:
# This will take just under 10 minutes
products_trimmed['product_embedding'] = products_trimmed.product_name.apply(lambda x: get_embedding(x, engine=EMBEDDING_MODEL))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  products_trimmed['product_embedding'] = products_trimmed.product_name.apply(lambda x: get_embedding(x, engine=EMBEDDING_MODEL))


In [9]:
products_trimmed.to_csv(os.path.join(data_dir,'products_with_embeddings.csv'))

## Create recommender based on similarity

We'll create a function which finds top results based on cosine similarity when providing an ingredient

In [10]:
# search through the reviews for a specific product
def search_reviews(df, product_name, n=3, pprint=True):
    embedding = get_embedding(
        product_name,
        engine=EMBEDDING_MODEL
    )
    df["similarities"] = df.product_embedding.apply(lambda x: cosine_similarity(x, embedding))
    
    # Here we choose to sort by similarity and take the top "n" results i.e. the top X with the best match
    res = (
        df.sort_values("similarities", ascending=False)
        .head(n)
    )
    '''if pprint:
        for r in res:
            print(r[:200])
            print()'''
    return res

In [11]:
# Green beans seems to work ok
res = search_reviews(products_trimmed, "green beans", n=3)
for idx,row in res.iterrows():
    print(row['product_name'])

Garbanzo Beans
Petite Green Peas
Petite Brussels Sprouts


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["similarities"] = df.product_embedding.apply(lambda x: cosine_similarity(x, embedding))


In [14]:
# Green beans seems to work ok
res = search_reviews(products_trimmed, "chocolate chip cookies", n=3)
for idx,row in res.iterrows():
    print(row['product_name'])

Cookie Chips Crunchy Dark Chocolate Chocolate Chip Cookies
Chocolate Sandwich Cookies
Gluten Free All Natural Chocolate Chip Cookies


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["similarities"] = df.product_embedding.apply(lambda x: cosine_similarity(x, embedding))
