# Video: Implementing Document Search with Language Model Embeddings

Language model embeddings are very useful keys for document search, and generally provide better search than the document vectors previously used in week 10.
In this video, we use language model embeddings to find similar recipes and to search for recipes given a description.

Script: (faculty on screen)
* Earlier in this module, we used simple document vectors based on vocabulary analysis to compare documents.
* Language model embeddings provide much better document vectors than those vocabulary-based vectors.
* In this video, we will use Google's Gemini embedding model to find similar recipes and search for recipes based on a query.

In [None]:
%pip install -q google-genai

In [None]:
import google.genai as genai
from google.genai import types
from google.colab import userdata
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
client = genai.Client(api_key=userdata.get('GEMINI_API_KEY'))

In [None]:
embedding_model_name = 'gemini-embedding-001'

In [None]:
recipes = pd.read_csv("https://raw.githubusercontent.com/bu-cds-omds/dx704-examples/refs/heads/main/data/recipes.tsv.gz", sep="\t")
recipes = recipes.set_index("recipe_slug")
recipes = recipes[:1000]

Script:
* I've loaded the recipe data from Bacon Powered Recipes again.
* I limited the data set to the first 1000 recipes so we won't have to wait too long to query all their document vectors.

In [None]:
recipe_embeddings = {}

Script:
* Like usual, we will cache the document vectors for easy access.
* For this video, I will just store them in a dictionary to make it maximally clear what data we are handling.
* A serious implementation would use a vector database for fast nearest neighbor queries instead of scanning all the vectors each time.

In [None]:
def get_embedding(text):
    response = client.models.embed_content(model=embedding_model_name,
                                           contents=text)
    return np.array(response.embeddings[0].values)

Script:
* This get_embedding function will query the API for a particular piece of text.
* I separated it out for this video so we can calculate embeddings for queries and not just recipes.
* Google has options to embed queries and values differently, but I am not using them.
* That option is Google-specific at the moment; most providers just have one embedding option.

In [None]:
def save_recipe_embedding(recipe_tuple):
    recipe_slug = recipe_tuple.Index
    if recipe_slug in recipe_embeddings:
        return

    embedding = get_embedding(recipe_tuple.recipe_introduction)
    recipe_embeddings[recipe_slug] = embedding

Script:
* This function saves an embedding for each recipe based on its introduction if it has not already been saved.

In [None]:
for r in recipes.itertuples():
    save_recipe_embedding(r)

Script:
* And here, all the recipes have their embeddings saved.
* I ran it beforehand, but it takes about 3 minutes for a thousand calls in a row.
* For the moment, I do not recommend trying to make multiple API calls at once as I've seen more API busy messages than I'd like prepping these videos.
* Let's start comparing recipes with these vectors now.

In [None]:
def closest_recipe(recipe_slug):
    return min([k for k in recipe_embeddings.keys() if k != recipe_slug],
                key=lambda x: np.linalg.norm(recipe_embeddings[x]- recipe_embeddings[recipe_slug]))

closest_recipe('apple-crisp')

'apple-crumble'

Script:
* This function takes in a recipe slug, looks up its embedding vector, and then finds the recipe with the closest vector.
* The first time I tested this, I omitted the check for the same recipe, and of course, it told me that the closest recipe to apple crisp is apple crisp.
* A vector will always be zero distance to itself.

In [None]:
closest_recipe('roasted-pear-and-butternut-squash-soup')

'butternut-squash-risotto'

Script:
* These sound pretty similar.
* Let's check some more.

In [None]:
for recipe_slug in recipes.index[:10]:
    print(recipe_slug, closest_recipe(recipe_slug))

spiced-pear-and-walnut-salad walnut-and-cranberry-salad
roasted-pear-and-butternut-squash-soup butternut-squash-risotto
peach-clafoutis raspberry-clafoutis
plum-clafoutis peach-clafoutis
pear olives
pear-and-gingerbread-trifle coconut-caramel-trifle
chicken-pot-pie vegetable-pot-pie
doritos-loaded-baked-potatoes doritos-loaded-potato-skins
pear-and-gorgonzola-tart asparagus-and-goat-cheese-tart
asparagus-and-goat-cheese-tart zucchini-and-goat-cheese-quiche


Script:
* These all look like very similar pairs to me.
* The most dissimilar pair is probably pear and oives.
* I imagine that some of these could be found by looking at just the vocabulary, but even the different words feel related, like soup and risotto.
* Let's try implementing search now.

In [None]:
def search_recipe(query):
    query_embedding = get_embedding(query)
    return min([k for k in recipe_embeddings.keys()],
                key=lambda x: np.linalg.norm(recipe_embeddings[x]- query_embedding))

search_recipe("crunchy sweet apple dish comfort food")

'apple-crisp'

Script:
* This function takes in a query, gets its embedding, and then finds the recipe with the closest vector.
* I had apple crisp in mind when I wrote this query, and got it on the first try.

In [None]:
search_recipe("easy soup for cold day")

'seafood-chowder'

Script:
* Chowder is not what I had in mind, but I guess it works.
* Let's rewrite that function to give multiple results.

In [None]:
def search_recipes(query):
    query_embedding = get_embedding(query)
    candidates = list(recipe_embeddings.keys())
    candidates.sort(key=lambda x: np.linalg.norm(recipe_embeddings[x]- query_embedding))
    return candidates[:10]

search_recipes("easy soup for cold day")

['seafood-chowder',
 'kosher-matzo-ball-soup',
 'classic-sujebi-soup',
 'matzo-ball-soup',
 'chicken-tteokguk',
 'sujebi',
 'french-onion-soup',
 'samgyetang-ginseng-chicken-soup',
 'chicken-salad',
 'kimchi-sujebi-stew']

Script:
* This version returns the ten recipes with vectors closest to the query vector.
* Half of these recipes are Korean soups and would be perfect for a cold day.
* I'm not sure Samgyetang counts as an easy soup, and chicken salad is definitely not a soup, but overall these look good.
* And those two that I called out are numbers 8 and 9, so they aren't the top picks either.

Script: (faculty on screen)
* Language model embeddings were a big step up improving search capabilities, and made good document search much easier to implement.
