# Long Context Colbert

Vamos criar uma aplicação do Vespa com Long Context Colbert, como pode ser visto [neste exemplo](https://pyvespa.readthedocs.io/en/latest/examples/chat_with_your_pdfs_using_colbert_langchain_and_Vespa-cloud.html).

In [29]:
# verify if the gpu is available
import torch

print(torch.cuda.is_available())

torch.device('cuda' if torch.cuda.is_available() else 'cpu')

True


device(type='cuda')

## Imports

Primeiro, vamos importar as bibliotecas necessárias para criar pacotes do Vespa.

In [30]:
from vespa.package import (
    ApplicationPackage,
    Component,
    Parameter,
    Field,
    HNSW,
    RankProfile,
    Function,
    FirstPhaseRanking,
    SecondPhaseRanking,
    FieldSet,
    DocumentSummary,
    Summary,
)
from pathlib import Path
import json
import pandas as pd
import ast
import numpy as np

from vespa.package import Schema, Document, Field, FieldSet

Vamos verificar se o Vespa está instalado:

In [31]:
!vespa --version

Usage:
  vespa [flags]
  vespa [command]

Available Commands:
  activate    Activate (deploy) a previously prepared application package
  auth        Manage Vespa Cloud credentials
  clone       Create files and directory structure from a Vespa sample application
  completion  Generate the autocompletion script for the specified shell
  config      Manage persistent values for global flags
  curl        Access Vespa directly using curl
  deploy      Deploy (prepare and activate) an application package
  destroy     Remove a deployed Vespa application and its data
  document    Issue a single document operation to Vespa
  feed        Feed multiple document operations to Vespa
  fetch       Download a deployed application package
  help        Help about any command
  log         Show the Vespa log
  prepare     Prepare an application package for activation
  prod        Deploy an application package to production in Vespa Cloud
  query       Issue a query to Vespa
  status      Show Ves

## Criação do aplicativo Vespa

Vamos criar o pacote do aplicativo Vespa, com os componentes `e5` e `colbert`:

In [32]:
from vespa.package import ApplicationPackage, Component, Parameter

vespa_app_name = "findmypasta"
app_package = ApplicationPackage(
    name=vespa_app_name,
    components=[
        Component(
            id="e5",
            type="hugging-face-embedder",
            parameters=[
                Parameter(
                    name="transformer-model",
                    args={
                        "url": "https://huggingface.co/intfloat/e5-small-v2/resolve/main/model.onnx"
                    },
                ),
                Parameter(
                    name="tokenizer-model",
                    args={
                        "url": "https://huggingface.co/intfloat/e5-small-v2/raw/main/tokenizer.json"
                    },
                ),
            ],
        ),
        Component(
            id="colbert",
            type="colbert-embedder",
            parameters=[
                Parameter(
                    name="transformer-model",
                    args={
                        "url": "https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/model.onnx"
                    },
                ),
                Parameter(
                    name="tokenizer-model",
                    args={
                        "url": "https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json"
                    },
                ),
            ],
        ),
    ],
)

Vamos criar o *Schema* com os campos da nossa receita, o `embedding` e o `colbert`:

In [33]:
app_package.schema.add_fields(
    Field(name="id", type="int", indexing=["attribute", "summary"]),
    Field(
        name="title", type="string", indexing=["index", "summary"], index="enable-bm25"
    ),
    # Field(
    #     name="description", type="string", indexing=["index", "summary"], index="enable-bm25"
    # ),
    # Field(
    #     name="minutes",
    #     type="string",
    #     indexing=["summary"],
    # ),
    # Field(
    #     name="n_steps",
    #     type="string",
    #     indexing=["attribute", "summary"],
    # ),
    # Field(
    #     name="n_ingredients",
    #     type="string",
    #     indexing=["attribute", "summary"],
    # ),
    # Field(
    #     name="submitted",
    #     type="string",
    #     indexing=["attribute", "summary"],
    # ),
    Field(
        name="body",
        type="string", 
        indexing=["index", "summary"],
        index="enable-bm25",
        bolding=True
    ),
    Field(
        name = "body_split",
        type = "array<string>",
        indexing = ["index", "summary"],
        index = "enable-bm25",
        bolding = True,
    ),
    # Field(
    #     name="tags",
    #     type="array<string>",
    #     indexing=["index", "summary"],
    #     index="enable-bm25",
    #     bolding=True,
    # ),
    Field(
        name="steps",
        type="array<string>",
        indexing=["index", "summary"],
        index="enable-bm25",
        bolding=True,
    ),
    Field(
        name="ingredients",
        type="array<string>",
        indexing=["index", "summary"],
        index="enable-bm25",
        bolding=True,
    ),
    #Field(
    #    name="colbert",
    #    type="tensor<int8>(token{},v[16])",
    #    indexing=["attribute", "summary", "index"],
    #    attribute=["distance-metric:hamming"],
    #)
    Field(
    name="embedding_body_split",
    type="tensor<bfloat16>(body_split{}, x[384])",
    indexing=[
        "input body_split",
        "embed e5",
        "attribute",
    ],
    attribute=["distance-metric: angular"],
    is_document_field=False,
    ),
    Field(
    name="colbert_body_split",
    type="tensor<int8>(body_split{}, token{}, v[16])",
    indexing=["input body_split", "embed colbert body_split", "attribute"],
    is_document_field=False,
    ),
    Field(
    name="embedding_steps",
    type="tensor<bfloat16>(steps{}, x[384])",
    indexing=[
        "input steps",
        "embed e5",
        "attribute",
    ],
    attribute=["distance-metric: angular"],
    is_document_field=False,
    ),
    Field(
    name="colbert_steps",
    type="tensor<int8>(steps{}, token{}, v[16])",
    indexing=["input steps", "embed colbert steps", "attribute"],
    is_document_field=False,
    ),
    Field(
    name="embedding_ingredients",
    type="tensor<bfloat16>(ingredients{}, x[384])",
    indexing=[
        "input ingredients",
        "embed e5",
        "attribute",
    ],
    attribute=["distance-metric: angular"],
    is_document_field=False,
    ),
    Field(
    name="colbert_ingredients",
    type="tensor<int8>(ingredients{}, token{}, v[16])",
    indexing=["input ingredients", "embed colbert ingredients", "attribute"],
    is_document_field=False,
    )
)

# add fieldset
app_package.schema.add_field_set(
    FieldSet(
        name="default",
        fields=["title", "body", "body_split", "steps", "ingredients"]
    )
)
    

Vamos criar o *RankProfile* com o [Colbert Context-Level](https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/):

In [34]:
colbert_max_body_split = RankProfile(
    name="colbert_max_body_split",
    inputs=[
        ("query(q)", "tensor<float>(x[384])"),
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
    ],
    functions=[
        Function(name="cos_sim_body_split", expression="closeness(field, embedding_body_split)"),
        Function(
            name="max_sim_per_context_body_split",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert_body_split)), v
                        ),
                        max, token
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="max_sim_body_split", expression="reduce(max_sim_per_context_body_split, max, body_split)"
            
        ),
    ],
    first_phase=FirstPhaseRanking(expression="cos_sim_body_split"),
    second_phase=SecondPhaseRanking(expression="max_sim_body_split"),
    match_features=["cos_sim_body_split", "max_sim_body_split", "max_sim_per_context_body_split"],
)

colbert_avg_body_split = RankProfile(
    name="colbert_avg_body_split",
    inputs=[
        ("query(q)", "tensor<float>(x[384])"),
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
    ],
    functions=[
        Function(name="cos_sim_body_split", expression="closeness(field, embedding_body_split)"),
        Function(
            name="max_sim_per_context_body_split",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert_body_split)), v
                        ),
                        max, token
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="avg_sim_body_split", expression="reduce(max_sim_per_context_body_split, avg, body_split)"
        ),
    ],
    first_phase=FirstPhaseRanking(expression="cos_sim_body_split"),
    second_phase=SecondPhaseRanking(expression="avg_sim_body_split"),
    match_features=["cos_sim_body_split", "avg_sim_body_split", "max_sim_per_context_body_split"],
)

colbert_max_steps= RankProfile(
    name="colbert_max_steps",
    inputs=[
        ("query(q)", "tensor<float>(x[384])"),
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
    ],
    functions=[
        Function(name="cos_sim_steps", expression="closeness(field, embedding_steps)"),
        Function(
            name="max_sim_per_context_steps",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert_steps)), v
                        ),
                        max, token
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="max_sim_steps", expression="reduce(max_sim_per_context_steps, max, steps)"
            
        ),
    ],
    first_phase=FirstPhaseRanking(expression="cos_sim_steps"),
    second_phase=SecondPhaseRanking(expression="max_sim_steps"),
    match_features=["cos_sim_steps", "max_sim_steps", "max_sim_per_context_steps"],
)

colbert_avg_steps = RankProfile(
    name="colbert_avg_steps",
    inputs=[
        ("query(q)", "tensor<float>(x[384])"),
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
    ],
    functions=[
        Function(name="cos_sim_steps", expression="closeness(field, embedding_steps)"),
        Function(
            name="max_sim_per_context_steps",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert_steps)), v
                        ),
                        max, token
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="avg_sim_steps", expression="reduce(max_sim_per_context_steps, avg, steps)"
        ),
    ],
    first_phase=FirstPhaseRanking(expression="cos_sim_steps"),
    second_phase=SecondPhaseRanking(expression="avg_sim_steps"),
    match_features=["cos_sim_steps", "avg_sim_steps", "max_sim_per_context_steps"],
)

colbert_max_ingredients = RankProfile(
    name="colbert_max_ingredients",
    inputs = [
        ("query(q)", "tensor<float>(x[384])"),
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
    ],
    functions=[
        Function(name="cos_sim_ingredients", expression="closeness(field, embedding_ingredients)"),
        Function(
            name="max_sim_per_context_ingredients",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert_ingredients)), v
                        ),
                        max, token
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="max_sim_ingredients", expression="reduce(max_sim_per_context_ingredients, max, ingredients)"
        ),
    ],
    first_phase=FirstPhaseRanking(expression="cos_sim_ingredients"),
    second_phase=SecondPhaseRanking(expression="max_sim_ingredients"),
    match_features=["cos_sim_ingredients", "max_sim_ingredients", "max_sim_per_context_ingredients"],
)

colbert_avg_ingredients = RankProfile(
    name="colbert_avg_ingredients",
    inputs = [
        ("query(q)", "tensor<float>(x[384])"),
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
    ],
    functions=[
        Function(name="cos_sim_ingredients", expression="closeness(field, embedding_ingredients)"),
        Function(
            name="max_sim_per_context_ingredients",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert_ingredients)), v
                        ),
                        max, token
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="avg_sim_ingredients", expression="reduce(max_sim_per_context_ingredients, avg, ingredients)"
        ),
    ],
    first_phase=FirstPhaseRanking(expression="cos_sim_ingredients"),
    second_phase=SecondPhaseRanking(expression="avg_sim_ingredients"),
    match_features=["cos_sim_ingredients", "avg_sim_ingredients", "max_sim_per_context_ingredients"],
)

colbert_max_body_split_bm25 = RankProfile(
    name="colbert_max_body_split_bm25",
    inherits="colbert_max_body_split",
    first_phase=FirstPhaseRanking(expression="cos_sim_body_split"),
    second_phase=SecondPhaseRanking(expression="max_sim_body_split + bm25(title)"),
    match_features=["cos_sim_body_split", "max_sim_body_split", "max_sim_per_context_body_split"],
)

colbert_avg_body_split_bm25 = RankProfile(
    name="colbert_avg_body_split_bm25",
    inherits="colbert_avg_body_split",
    first_phase=FirstPhaseRanking(expression="cos_sim_body_split"),
    second_phase=SecondPhaseRanking(expression="avg_sim_body_split + bm25(title)"),
    match_features=["cos_sim_body_split", "avg_sim_body_split", "max_sim_per_context_body_split"],
)

colbert_max_steps_bm25 = RankProfile(
    name="colbert_max_steps_bm25",
    inherits="colbert_max_steps",
    first_phase=FirstPhaseRanking(expression="cos_sim_steps"),
    second_phase=SecondPhaseRanking(expression="max_sim_steps + bm25(title)"),
    match_features=["cos_sim_steps", "max_sim_steps", "max_sim_per_context_steps"],
)

colbert_avg_steps_bm25 = RankProfile(
    name="colbert_avg_steps_bm25",
    inherits="colbert_avg_steps",
    first_phase=FirstPhaseRanking(expression="cos_sim_steps"),
    second_phase=SecondPhaseRanking(expression="avg_sim_steps + bm25(title)"),
    match_features=["cos_sim_steps", "avg_sim_steps", "max_sim_per_context_steps"],
)

colbert_max_ingredients_bm25 = RankProfile(
    name="colbert_max_ingredients_bm25",
    inherits="colbert_max_ingredients",
    first_phase=FirstPhaseRanking(expression="cos_sim_ingredients"),
    second_phase=SecondPhaseRanking(expression="max_sim_ingredients + bm25(title)"),
    match_features=["cos_sim_ingredients", "max_sim_ingredients", "max_sim_per_context_ingredients"],
)

colbert_avg_ingredients_bm25 = RankProfile(
    name="colbert_avg_ingredients_bm25",
    inherits="colbert_avg_ingredients",
    first_phase=FirstPhaseRanking(expression="cos_sim_ingredients"),
    second_phase=SecondPhaseRanking(expression="avg_sim_ingredients + bm25(title)"),
    match_features=["cos_sim_ingredients", "avg_sim_ingredients", "max_sim_per_context_ingredients"],
)

colbert_cross_body_split = RankProfile(
    name="colbert_cross_body_split",
    inputs=[
        ("query(q)", "tensor<float>(x[384])"),
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
    ],
    functions=[
        Function(name="cos_sim_body_split", expression="closeness(field, embedding_body_split)"),
        Function(
            name="cross_max_sim_body_split",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert_body_split)), v
                        ),
                        max, token, body_split
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="max_cross_sim_body_split", expression="reduce(cross_max_sim_body_split, max)"
            
        ),
    ],
    first_phase=FirstPhaseRanking(expression="cos_sim_body_split"),
    second_phase=SecondPhaseRanking(expression="max_cross_sim_body_split"),
    match_features=["cos_sim_body_split", "max_cross_sim_body_split", "cross_max_sim_body_split"],
)
      

app_package.schema.add_rank_profile(colbert_max_body_split)
app_package.schema.add_rank_profile(colbert_avg_body_split)
app_package.schema.add_rank_profile(colbert_max_steps)
app_package.schema.add_rank_profile(colbert_avg_steps)
app_package.schema.add_rank_profile(colbert_max_ingredients)
app_package.schema.add_rank_profile(colbert_avg_ingredients)
app_package.schema.add_rank_profile(colbert_max_body_split_bm25)
app_package.schema.add_rank_profile(colbert_avg_body_split_bm25)
app_package.schema.add_rank_profile(colbert_max_steps_bm25)
app_package.schema.add_rank_profile(colbert_avg_steps_bm25)
app_package.schema.add_rank_profile(colbert_max_ingredients_bm25)
app_package.schema.add_rank_profile(colbert_avg_ingredients_bm25)
app_package.schema.add_rank_profile(colbert_cross_body_split)

In [35]:

#Path("pkg").mkdir(parents=True, exist_ok=True)
#app_package.to_files("pkg")

In [36]:

#! mkdir -p pkg/model
#! curl -L -o pkg/model/tokenizer.json \
#  https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json

#! curl -L -o pkg/model/model.onnx \
#  https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/model.onnx

Vamos realizar o *deploy* do pacote do Vespa pelo Docker:

In [37]:
from vespa.deployment import VespaDocker

#vespa_docker = VespaDocker()
#app = vespa_docker.deploy_from_disk(application_name="findmypasta", application_root="pkg")

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=app_package)


Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for configuration server, 10/300 seconds...
Waiting for configuration server, 15/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 10/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 15/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 20/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 25/300 seconds...
Using plain http against endpoint http://localhost:8

## Fornecendo dados

Vamos criar uma função que cria o campo `body`:

In [38]:
def recipe_file_body_lines(recipe, complementary_data = None):
    """
    Function responsible for creating the recipe body.
    """
    # Transformar as colunas de strings para listas
    recipe['tags'] = recipe['tags'].strip("[]").replace("'", "").split(', ')
    recipe['steps'] = recipe['steps'].strip("[]").replace("'", "").split(', ')
    recipe['ingredients'] = recipe['ingredients'].strip("[]").replace("'", "").split(', ')

    # reviews = complementary_data[complementary_data['recipe_id'] == recipe['id']]

    # # ordering by descending date
    # reviews = reviews.sort_values('date', ascending=False)

    # # getting the average rating
    # avg_rating = reviews['rating'].mean()

    # # if the average rating is NaN, we will set it to "No reviews"
    # if np.isnan(avg_rating):
    #     avg_rating = "No reviews"

    # creating the recipe body
    recipe_body = recipe['name'] + '\n' \
    + "Recipe posted on: " + str(recipe['submitted']) + '\n' \
    + "Tags: " + ', '.join(recipe['tags']) + '\n' \
    + "Description: " + recipe['description'] + '\n' \
    + "This recipe takes " + str(recipe['minutes']) + " minutes to be done." + '\n' \
    + "For this recipe you will need the ingredients: " + '\n' \
    + ', '.join(recipe['ingredients']) + '\n' \
    + "The " + str(recipe["n_steps"]) + " steps to make this recipe are: " + '\n' \
    + ', '.join(recipe['steps']) 
    return recipe_body

In [39]:
# Função para aplicar recipe_file_body_lines a cada linha do DataFrame de receitas
def apply_recipe_file_body_lines(recipe_row):
    return recipe_file_body_lines(recipe_row)

Agora, pegamos os dados presentes no dataset `../input/RAW_recipes.csv`, definimos os campos que serão enviados e formatamos para o formato do Vespa:

In [40]:
df = pd.read_csv('../input/RAW_recipes.csv')
###
##

# print columns
print(df.columns)
df = df.dropna()
df = df.reset_index(drop=True)

df['body'] = df.apply(apply_recipe_file_body_lines, axis=1)
df['body_split'] = df['body'].str.split('\n')

df['minutes'] = "This recipe takes " + df['minutes'].astype(str) + " minutes to be done."
df['submitted'] = 'Recipe submitted on: ' + df["submitted"]
df['tags'] = df["tags"]
df['n_steps'] = 'Number of steps to make this recipe: ' + df['n_steps'].astype(str)
df['n_ingredients'] = 'Number of ingredients: ' + df['n_ingredients'].astype(str)
df['steps'] = df["steps"]
df['description'] = df["description"]
df['ingredients'] = df["ingredients"]
df['title'] = df['name']

namespace = "recipes"
document_type = "findmypasta"

def to_vespa_format(x):
    document_id = f"id:{namespace}:{document_type}::{x['id']}"
    return {
        "put": document_id,
        "fields": {
            "id": x["id"],
            "title": x["name"],
            #"tags": ast.literal_eval(x["tags"]),
            "steps": ast.literal_eval(x["steps"]),
            #"description": x["description"],
            "ingredients": ast.literal_eval(x["ingredients"]),
            #"minutes": x["minutes"],
            #"n_steps": x["n_steps"],
            #"n_ingredients": x["n_ingredients"],
            #"submitted": x["submitted"],
            #"body": x["body"],
            "body_split": x["body_split"]
        }
    }

vespa_feed = df.apply(to_vespa_format, axis=1).tolist()
vespa_feed_slice = vespa_feed[0:10000]
print(vespa_feed_slice[0])

Index(['name', 'id', 'minutes', 'contributor_id', 'submitted', 'tags',
       'nutrition', 'n_steps', 'steps', 'description', 'ingredients',
       'n_ingredients'],
      dtype='object')
{'put': 'id:recipes:findmypasta::137739', 'fields': {'id': 137739, 'title': 'arriba   baked winter squash mexican style', 'steps': ['make a choice and proceed with recipe', 'depending on size of squash , cut into half or fourths', 'remove seeds', 'for spicy squash , drizzle olive oil or melted butter over each cut squash piece', 'season with mexican seasoning mix ii', 'for sweet squash , drizzle melted honey , butter , grated piloncillo over each cut squash piece', 'season with sweet mexican spice mix', 'bake at 350 degrees , again depending on size , for 40 minutes up to an hour , until a fork can easily pierce the skin', 'be careful not to burn the squash especially if you opt to use sugar or butter', 'if you feel more comfortable , cover the squash with aluminum foil the first half hour , give or t

In [41]:
df

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients,body,body_split,title
0,arriba baked winter squash mexican style,137739,This recipe takes 55 minutes to be done.,47892,Recipe submitted on: 2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",Number of steps to make this recipe: 11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",Number of ingredients: 7,arriba baked winter squash mexican style\nRe...,"[arriba baked winter squash mexican style, R...",arriba baked winter squash mexican style
1,a bit different breakfast pizza,31490,This recipe takes 30 minutes to be done.,26278,Recipe submitted on: 2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",Number of steps to make this recipe: 9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",Number of ingredients: 6,a bit different breakfast pizza\nRecipe poste...,"[a bit different breakfast pizza, Recipe post...",a bit different breakfast pizza
2,all in the kitchen chili,112140,This recipe takes 130 minutes to be done.,196586,Recipe submitted on: 2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",Number of steps to make this recipe: 6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",Number of ingredients: 13,all in the kitchen chili\nRecipe posted on: 2...,"[all in the kitchen chili, Recipe posted on: ...",all in the kitchen chili
3,alouette potatoes,59389,This recipe takes 45 minutes to be done.,68585,Recipe submitted on: 2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",Number of steps to make this recipe: 11,['place potatoes in a large pot of lightly sal...,"this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n...",Number of ingredients: 11,alouette potatoes\nRecipe posted on: 2003-04-...,"[alouette potatoes, Recipe posted on: 2003-04...",alouette potatoes
4,amish tomato ketchup for canning,44061,This recipe takes 190 minutes to be done.,41706,Recipe submitted on: 2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",Number of steps to make this recipe: 5,['mix all ingredients& boil for 2 1 / 2 hours ...,my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar...",Number of ingredients: 8,amish tomato ketchup for canning\nRecipe pos...,"[amish tomato ketchup for canning, Recipe po...",amish tomato ketchup for canning
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
226652,zydeco soup,486161,This recipe takes 60 minutes to be done.,227978,Recipe submitted on: 2012-08-29,"['ham', '60-minutes-or-less', 'time-to-make', ...","[415.2, 26.0, 34.0, 26.0, 44.0, 21.0, 15.0]",Number of steps to make this recipe: 7,"['heat oil in a 4-quart dutch oven', 'add cele...",this is a delicious soup that i originally fou...,"['celery', 'onion', 'green sweet pepper', 'gar...",Number of ingredients: 22,zydeco soup\nRecipe posted on: 2012-08-29\nTag...,"[zydeco soup, Recipe posted on: 2012-08-29, Ta...",zydeco soup
226653,zydeco spice mix,493372,This recipe takes 5 minutes to be done.,1500678,Recipe submitted on: 2013-01-09,"['15-minutes-or-less', 'time-to-make', 'course...","[14.8, 0.0, 2.0, 58.0, 1.0, 0.0, 1.0]",Number of steps to make this recipe: 1,['mix all ingredients together thoroughly'],this spice mix will make your taste buds dance!,"['paprika', 'salt', 'garlic powder', 'onion po...",Number of ingredients: 13,zydeco spice mix\nRecipe posted on: 2013-01-09...,"[zydeco spice mix, Recipe posted on: 2013-01-0...",zydeco spice mix
226654,zydeco ya ya deviled eggs,308080,This recipe takes 40 minutes to be done.,37779,Recipe submitted on: 2008-06-07,"['60-minutes-or-less', 'time-to-make', 'course...","[59.2, 6.0, 2.0, 3.0, 6.0, 5.0, 0.0]",Number of steps to make this recipe: 7,"['in a bowl , combine the mashed yolks and may...","deviled eggs, cajun-style","['hard-cooked eggs', 'mayonnaise', 'dijon must...",Number of ingredients: 8,zydeco ya ya deviled eggs\nRecipe posted on: 2...,"[zydeco ya ya deviled eggs, Recipe posted on: ...",zydeco ya ya deviled eggs
226655,cookies by design cookies on a stick,298512,This recipe takes 29 minutes to be done.,506822,Recipe submitted on: 2008-04-15,"['30-minutes-or-less', 'time-to-make', 'course...","[188.0, 11.0, 57.0, 11.0, 7.0, 21.0, 9.0]",Number of steps to make this recipe: 9,['place melted butter in a large mixing bowl a...,"i've heard of the 'cookies by design' company,...","['butter', 'eagle brand condensed milk', 'ligh...",Number of ingredients: 10,cookies by design cookies on a stick\nRecipe...,"[cookies by design cookies on a stick, Recip...",cookies by design cookies on a stick


Cria o json com os campos formatados:

In [42]:
with open("vespa_feed2.jsonl", "w") as f:
    for item in vespa_feed_slice:
        f.write(json.dumps(item) + "\n")

Alimenta o Vespa com os documentos:

In [43]:
! vespa config set target local
! vespa feed vespa_feed2.jsonl

{
  "feeder.operation.count": 100,
  "feeder.seconds": 157.300,
  "feeder.ok.count": 100,
  "feeder.ok.rate": 0.636,
  "feeder.error.count": 0,
  "feeder.inflight.count": 0,
  "http.request.count": 100,
  "http.request.bytes": 81098,
  "http.request.MBps": 0.001,
  "http.exception.count": 0,
  "http.response.count": 100,
  "http.response.bytes": 9422,
  "http.response.MBps": 0.000,
  "http.response.error.count": 0,
  "http.response.latency.millis.min": 2131,
  "http.response.latency.millis.avg": 25363,
  "http.response.latency.millis.max": 46928,
  "http.response.code.counts": {
    "200": 100
  }
}


In [44]:
documents = app.query(yql = "select * from sources * where true")
documents.number_documents_indexed

100

## Queries

Vamos pegar as queries que serão feitas, presentes no dataset `../input/Recipe_Search_Questions.xlsx`:

In [45]:
# loading the Questions.xlsx and answering each question query
import pandas as pd
questions = pd.read_excel('../input/Questions.xlsx')
questions = pd.read_excel('../input/Recipe_Search_Questions.xlsx')
questions.head()

Unnamed: 0,Tipo,Descrição,Query
0,Keywords,Pergunta simples,grilled cheese sandwich recipe
1,Keywords,Pergunta simples,mango smoothie
2,Semantica,Pergunta média,gluten-free bread without yeast
3,Semantica,Pergunta média,low carb dessert for diabetics
4,Semantica,Pergunta difícil,traditional Japanese breakfast for a family


Geramos o arquivo de output com as respostas para cada *query*:

In [46]:
from vespa.io import VespaQueryResponse
import json

# Supondo que 'questions' é um DataFrame com colunas ['Query', 'Tipo', 'Descrição']
data = pd.DataFrame(columns=['id', 'title', 'Query', 'Tipo', 'Descrição'])

model_to_ranking_dict = {
    "colbert_max_body_split": "colbert_max_body_split",
    "colbert_avg_body_split": "colbert_avg_body_split",
    "colbert_max_steps": "colbert_max_steps",
    "colbert_avg_steps": "colbert_avg_steps",
    "colbert_max_ingredients": "colbert_max_ingredients",
    "colbert_avg_ingredients": "colbert_avg_ingredients",
    "colbert_max_body_split_bm25": "colbert_max_body_split_bm25",
    "colbert_avg_body_split_bm25": "colbert_avg_body_split_bm25",
    "colbert_max_steps_bm25": "colbert_max_steps_bm25",
    "colbert_avg_steps_bm25": "colbert_avg_steps_bm25",
    "colbert_max_ingredients_bm25": "colbert_max_ingredients_bm25",
    "colbert_avg_ingredients_bm25": "colbert_avg_ingredients_bm25",
    "colbert_cross_body_split": "colbert_cross_body_split"
}

lista_body_split = ['colbert_max_body_split', 'colbert_avg_body_split', 'colbert_max_body_split_bm25', 'colbert_avg_body_split_bm25']

embeddings = {
    "colbert_max_body_split": "embedding_body_split",
    "colbert_avg_body_split": "embedding_body_split",
    "colbert_max_steps": "embedding_steps",
    "colbert_avg_steps": "embedding_steps",
    "colbert_max_ingredients": "embedding_ingredients",
    "colbert_avg_ingredients": "embedding_ingredients",
    "colbert_max_body_split_bm25": "embedding_body_split",
    "colbert_avg_body_split_bm25": "embedding_body_split",
    "colbert_max_steps_bm25": "embedding_steps",
    "colbert_avg_steps_bm25": "embedding_steps",
    "colbert_max_ingredients_bm25": "embedding_ingredients",
    "colbert_avg_ingredients_bm25": "embedding_ingredients",
    "colbert_cross_body_split": "embedding_body_split"
}

# makes it so each model in lista_body_split has a key, empty dict value
relevance = {model: {} for model in lista_body_split}

for selected_model in model_to_ranking_dict.keys():
    output_name = 'output/Results_' + selected_model + '_extraQuestions' + '.xlsx'
    embedding = embeddings[selected_model]

    if model_to_ranking_dict[selected_model] is not None:
        i = 0
        for input_query in questions['Query']:
            # save a checkpoint each 100 queries
            if i % 100 == 0:
                data.to_excel(output_name, index=False)

            with app.syncio(connections=1) as session:
                try:
                    response: VespaQueryResponse = session.query(
                        yql="select * from sources * where rank({targetHits:1000}nearestNeighbor(" +embedding+",q), userQuery()) limit 5",
                        #yql="select * from sources * where ({targetHits:1000}nearestNeighbor(" + embedding + ",q)) limit 5",
                        query=input_query,
                        ranking=model_to_ranking_dict[selected_model],
                        body={
                            "input.query(q)": f"embed(e5, \"{input_query}\")",
                            "input.query(qt)": f"embed(colbert, \"{input_query}\")",
                            # "input.query(q)": f"embed({input_query})",
                            #"timeout": "30s"  # Aumentar o tempo limite para 10 segundos
                        },
                        hits = 5
                    )
                    assert response.is_successful()
                except Exception as e:
                    print(f"Error with query '{input_query}': {e}")
                    continue

                for hit in response.hits:
                    record = {}
                    for field in ['id', 'title']:
                        record[field] = hit['fields'].get(field, None)
                    record["Query"] = input_query
                    record["Tipo"] = questions[questions['Query'] == input_query]['Tipo'].values[0]
                    record["Descrição"] = questions[questions['Query'] == input_query]['Descrição'].values[0]
                    data = pd.concat([data, pd.DataFrame([record])], ignore_index=True)

                    if model_to_ranking_dict[selected_model] in lista_body_split:
                        cells = hit['fields']['matchfeatures']['max_sim_per_context_body_split']['cells']
                        # turn the dict into a list of values
                        values = list(cells.values())
                        if input_query not in relevance[model_to_ranking_dict[selected_model]]:
                            relevance[model_to_ranking_dict[selected_model]][input_query] = []
                        # if the key input_query has not been added to the relevance dictionary yet
                        relevance[model_to_ranking_dict[selected_model]][input_query].append(values)

            i += 1

        # Sorting
        data = data.sort_values(by=['Tipo', 'Query'])

        # reordering columns
        data = data[['Tipo', 'Descrição', 'Query', 'id', 'title']]

        # exporting to excel
        data.to_excel(output_name, index=False)

        
            


In [47]:
relevance

{'colbert_max_body_split': {'grilled cheese sandwich recipe': [[79.8575668334961,
    37.86740493774414,
    39.07607650756836,
    48.190711975097656,
    46.08265686035156,
    44.72810745239258,
    47.21826171875,
    46.92736053466797,
    51.74022674560547],
   [33.92506790161133,
    37.07310485839844,
    49.689884185791016,
    16.444217681884766,
    46.84369659423828,
    44.72810745239258,
    69.46224975585938,
    47.12784957885742,
    79.0047607421875],
   [42.5723876953125,
    38.151344299316406,
    38.420379638671875,
    37.0662727355957,
    46.58928298950195,
    44.72810745239258,
    63.02363967895508,
    47.119239807128906,
    75.8159408569336],
   [38.47965621948242,
    42.308929443359375,
    31.994760513305664,
    24.36667251586914,
    46.576499938964844,
    44.72810745239258,
    37.185707092285156,
    47.268096923828125,
    75.17180633544922],
   [74.82846069335938,
    37.760353088378906,
    31.730398178100586,
    33.6027946472168,
    47.65190

Respostas para a query `chocolate`:

In [48]:
from vespa.io import VespaQueryResponse

with app.syncio(connections=1) as session:
    response:VespaQueryResponse = session.query(
        yql="select * from sources * where ({targetHits:1000}nearestNeighbor(embedding_body_split,q))",
        ranking="colbert_max_body_split",
        query="chocolate", 
        body={
            "input.query(q)": f'embed(e5, "chocolate")',
            "input.query(qt)": f'embed(colbert, "chocolate")',
        },
    )

assert(response.is_successful())
for hit in response.hits:
    record = {}
    for field in ['id', 'title', 'body_split']:
        record[field] = hit['fields'][field]
    print(record)

{'id': 58651, 'title': 'turtle  squares', 'body_split': ['turtle  squares', 'Recipe posted on: 2003-04-07', 'Tags: 30-minutes-or-less, time-to-make, course, main-ingredient, cuisine, preparation, occasion, north-american, for-large-groups, desserts, fruit, oven, easy, finger-food, kid-friendly, cookies-and-brownies, chocolate, bar-cookies, nuts, dietary, low-sodium, low-in-something, taste-mood, sweet, equipment, number-of-servings, presentation', 'Description: for lovers of pecans and chocolate...', 'This recipe takes 30 minutes to be done.', 'For this recipe you will need the ingredients: ', 'flour, brown sugar, butter, pecans, semi-sweet chocolate chips', 'The 15 steps to make this recipe are: ', 'preheat oven to 350 degrees f, spray a 13 x 9 baking pan evenly with non-stick cooking spray, beat 1 cup brown sugar with 1 / 2 cup melted butter with an electric mixer on medium for 2-3 minutes, add the flour mixture and mix until smooth, press the flour mixture evenly and firmly into the

Podemos ver a relevância total e média de cada canto, sendo que:

- `0`:  "recipe_body = recipe['name'] + '\n'";
- `1`:  "Recipe posted on: " + str(recipe['submitted']) + '\n'";
- `2`:  "Tags: " + ', '.join(recipe['tags']) + '\n'";
- `3`:  "Description: " + recipe['description'] + '\n'";
- `4`:  "This recipe takes " + str(recipe['minutes']) + " minutes to be done." + '\n'";
- `5`:  "For this recipe you will need the ingredients: " + '\n'";
- `6`:  ', '.join(recipe['ingredients']) + '\n'";
- `7`:  "The " + str(recipe["n_steps"]) + " steps to make this recipe are: " + '\n'";
- `8`:  ', '.join(recipe['steps'])";

In [49]:
#response.hits[0]
# get the cells of the first hit
#print(response.hits[0]['fields']['matchfeatures']['max_sim_per_context']['cells'])

total = {'0': 0, '1': 0, '2': 0, '3': 0, '4': 0, '5': 0, '6': 0, '7': 0, '8': 0}

for hit in response.hits:
    cells = hit['fields']['matchfeatures']['max_sim_per_context_body_split']['cells']
    for key in total.keys():
        total[key] += cells[key]
    print(hit['fields']['id'])

median = total 
median = {k: v / len(response.hits) for k, v in median.items()}

print(total)
print(median)

58651
32271
27087
62368
35964
71635
44895
39363
23933
107699
{'0': 420.0016670227051, '1': 274.50904846191406, '2': 308.63941764831543, '3': 416.2218737602234, '4': 181.57076835632324, '5': 84.61458206176758, '6': 570.7690467834473, '7': 135.15281772613525, '8': 338.6511507034302}
{'0': 42.000166702270505, '1': 27.450904846191406, '2': 30.863941764831544, '3': 41.62218737602234, '4': 18.157076835632324, '5': 8.461458206176758, '6': 57.076904678344725, '7': 13.515281772613525, '8': 33.865115070343016}


In [51]:
# transform relevance into a DataFrame
colbert_max_body_split_relevance = pd.DataFrame(relevance['colbert_max_body_split'])
# flip columns and rows
#colbert_max_body_split_relevance = colbert_max_body_split_relevance.T
# add column with name of the model
#colbert_max_body_split_relevance['Model'] = 'colbert_max_body_split'
# reset index
#colbert_max_body_split_relevance = colbert_max_body_split_relevance.reset_index()
# change name of the columns
#colbert_max_body_split_relevance.columns = ['Query', 'Name', 'Submitted', 'Tags', 'Description', 'Minutes', 'Filler', 'Ingredients', 'N_steps', 'Steps', 'Model']
# change order of columns
#colbert_max_body_split_relevance = colbert_max_body_split_relevance[['Model', 'Query', 'Name', 'Submitted', 'Tags', 'Description', 'Minutes', 'Filler', 'Ingredients', 'N_steps', 'Steps']]
colbert_max_body_split_relevance


Unnamed: 0,grilled cheese sandwich recipe,mango smoothie,gluten-free bread without yeast,low carb dessert for diabetics,traditional Japanese breakfast for a family,What kind of soup can I make with butternut squash and coconut milk?,recipe for chicken curry,how to make iced tea,vegan options for a Thanksgiving dinner,What can I cook with quinoa and kale for a nutritious meal?,...,apple pie,Brûlée Cream,how to make a pizza without an oven,pancake without flour and milk,healthy recipe for quick lunch,what can I make for a romantic dinner,I'm vegan. How can I make a bolognese?,"I have 9min maximum to make a lunch, can you help me?",Esfiha de carne vegana,Chocolate pizza
0,"[79.8575668334961, 37.86740493774414, 39.07607...","[82.4940414428711, 20.301599502563477, 56.9345...","[15.141139030456543, 14.111498832702637, 56.53...","[28.021432876586914, 43.32831573486328, 90.881...","[10.285633087158203, 9.859478950500488, 3.6066...","[30.073957443237305, 28.512340545654297, 23.51...","[73.51380157470703, 37.4853630065918, 20.70315...","[10.091914176940918, 14.661450386047363, 18.16...","[37.19544219970703, 37.15163040161133, 83.8871...","[25.6418399810791, 14.753334045410156, 25.8547...",...,"[60.50924301147461, 24.188045501708984, 26.545...","[69.75659942626953, 21.072011947631836, 37.562...","[60.926517486572266, 22.118410110473633, 54.85...","[51.03062057495117, 27.452587127685547, 38.786...","[30.058250427246094, 33.23168182373047, 68.285...","[52.8565673828125, 27.33098030090332, 65.65467...","[73.68089294433594, 21.6163387298584, 28.48501...","[23.195350646972656, 17.771636962890625, 70.44...","[55.21561050415039, 16.931758880615234, 24.701...","[89.65353393554688, 34.90876770019531, 27.2881..."
1,"[33.92506790161133, 37.07310485839844, 49.6898...","[57.454002380371094, 16.475730895996094, 3.251...","[55.848236083984375, 7.313124656677246, 18.303...","[42.3276252746582, 43.745643615722656, 87.3068...","[13.891664505004883, 12.112494468688965, 5.066...","[68.50535583496094, 21.271812438964844, 46.143...","[76.40644836425781, 39.495269775390625, 35.091...","[7.279230117797852, 9.37393569946289, 5.069894...","[44.58099365234375, 38.707008361816406, 78.032...","[17.288116455078125, 15.614280700683594, 22.77...",...,"[83.87152862548828, 27.054054260253906, 19.499...","[26.977062225341797, 21.850948333740234, 1.776...","[27.701200485229492, 32.391998291015625, 31.79...","[82.6087417602539, 25.992237091064453, 46.6906...","[37.93263244628906, 29.17930793762207, 35.2290...","[20.326974868774414, 31.498088836669922, 88.25...","[21.91469955444336, 20.493850708007812, 35.126...","[47.771568298339844, 18.64077377319336, 64.966...","[49.159786224365234, 17.401531219482422, 34.01...","[25.104711532592773, 40.35238265991211, 54.058..."
2,"[42.5723876953125, 38.151344299316406, 38.4203...","[41.26013946533203, 15.325758934020996, 16.596...","[27.818124771118164, 12.783323287963867, 20.42...","[25.255207061767578, 43.26431655883789, 80.291...","[35.86355209350586, 12.963134765625, 25.472202...","[51.35893630981445, 22.701366424560547, 40.276...","[75.3857650756836, 45.491058349609375, 48.5688...","[26.70091438293457, 11.65317440032959, 28.5319...","[35.2246208190918, 40.6373291015625, 74.461120...","[19.86221694946289, 16.164268493652344, 21.492...",...,"[78.81370544433594, 22.227397918701172, 46.278...","[62.74779510498047, 17.62429428100586, 34.5983...","[68.9728775024414, 22.766002655029297, 31.1071...","[36.200836181640625, 22.136402130126953, 42.66...","[18.817188262939453, 26.66770362854004, 55.104...","[70.04489135742188, 28.48811912536621, 73.1636...","[47.58837127685547, 19.280973434448242, 41.875...","[45.66268539428711, 19.100440979003906, 62.509...","[47.307010650634766, 19.895362854003906, 14.55...","[80.51195526123047, 34.60174560546875, 60.3612..."
3,"[38.47965621948242, 42.308929443359375, 31.994...","[29.575815200805664, 17.542787551879883, 28.67...","[48.913780212402344, 10.264599800109863, 34.17...","[45.32938003540039, 40.0904541015625, 78.54539...","[35.263282775878906, 13.648066520690918, 24.21...","[64.0052719116211, 22.83452606201172, 50.51630...","[67.38259887695312, 39.630767822265625, 46.625...","[-5.601774215698242, 11.81130313873291, 7.9657...","[64.3638687133789, 42.65726852416992, 73.07497...","[7.98209285736084, 18.901386260986328, 22.5421...",...,"[33.85438537597656, 29.99755096435547, 25.0147...","[58.52688217163086, 21.62645149230957, 37.9750...","[28.565176010131836, 21.78206443786621, 28.591...","[43.30500030517578, 31.214929580688477, 48.262...","[31.70859146118164, 31.073530197143555, 62.840...","[33.181053161621094, 35.01700210571289, 38.285...","[29.26022720336914, 20.98968505859375, 35.2217...","[34.339847564697266, 24.056838989257812, 61.80...","[43.89531326293945, 16.227956771850586, 36.230...","[45.45659255981445, 31.301179885864258, 56.661..."
4,"[74.82846069335938, 37.760353088378906, 31.730...","[47.43881607055664, 24.29666519165039, 10.4422...","[48.71798324584961, 18.6467342376709, 27.13154...","[42.27973937988281, 40.57823944091797, 76.7731...","[34.96421432495117, 12.925195693969727, 12.966...","[63.21371841430664, 22.666770935058594, 51.938...","[60.46963119506836, 40.5334587097168, 49.99179...","[17.527250289916992, 0.5293371081352234, 16.78...","[43.442928314208984, 37.673065185546875, 73.94...","[26.745018005371094, 17.55309295654297, 25.035...",...,"[31.519832611083984, 23.98388671875, 60.331935...","[22.135805130004883, 25.026803970336914, 10.00...","[11.857139587402344, 27.038618087768555, 35.18...","[53.46022415161133, 28.5036678314209, 35.18632...","[46.49198913574219, 28.37847900390625, 58.0714...","[24.96954917907715, 32.70132827758789, 46.9972...","[29.104001998901367, 18.213245391845703, 23.18...","[27.757898330688477, 19.092958450317383, 37.77...","[43.54084396362305, 15.613653182983398, 23.812...","[59.03912353515625, 34.06766128540039, 41.0628..."


In [55]:
relevance['colbert_max_body_split']

{'grilled cheese sandwich recipe': [[79.8575668334961,
   37.86740493774414,
   39.07607650756836,
   48.190711975097656,
   46.08265686035156,
   44.72810745239258,
   47.21826171875,
   46.92736053466797,
   51.74022674560547],
  [33.92506790161133,
   37.07310485839844,
   49.689884185791016,
   16.444217681884766,
   46.84369659423828,
   44.72810745239258,
   69.46224975585938,
   47.12784957885742,
   79.0047607421875],
  [42.5723876953125,
   38.151344299316406,
   38.420379638671875,
   37.0662727355957,
   46.58928298950195,
   44.72810745239258,
   63.02363967895508,
   47.119239807128906,
   75.8159408569336],
  [38.47965621948242,
   42.308929443359375,
   31.994760513305664,
   24.36667251586914,
   46.576499938964844,
   44.72810745239258,
   37.185707092285156,
   47.268096923828125,
   75.17180633544922],
  [74.82846069335938,
   37.760353088378906,
   31.730398178100586,
   33.6027946472168,
   47.65190505981445,
   44.72810745239258,
   36.867557525634766,
   48.27066

In [89]:
import pandas as pd

# columns will be 'Model', 'Query', 'Name', 'Submitted', 'Tags', 'Description', 'Minutes', 'Filler', 'Ingredients', 'N_steps', 'Steps'
columns = ['Model', 'Query', 'Name', 'Submitted', 'Tags', 'Description', 'Minutes', 'Filler', 'Ingredients', 'N_steps', 'Steps']

# Assuming 'relevance' is your dictionary containing the data
# Initialize an empty list to store all rows
all_rows = []
wrong_size = []
def relevance_dataframe(relevance, model_name):
    # Iterate over the keys of the relevance dictionary
    for query in relevance[model_name].keys():
        for relevance_list in relevance[model_name][query]:
            # Construct the row with 'colbert_max_body_split', query, and the elements of the relevance list
            row = [model_name, query] + relevance_list
            if len(row) != 11:
                wrong_size.append(row)
                continue
            # Append the constructed row to the list of all rows
            all_rows.append(row)

    # Create a DataFrame from the list of all rows
    return pd.DataFrame(all_rows, columns=columns), wrong_size


colbert_max_body_split_relevance, wrong_size1 = relevance_dataframe(relevance, 'colbert_max_body_split')
colbert_avg_body_split_relevance, wrong_size2 = relevance_dataframe(relevance, 'colbert_avg_body_split')
colbert_max_body_split_bm25_relevance, wrong_size3 = relevance_dataframe(relevance, 'colbert_max_body_split_bm25')
colbert_avg_body_split_bm25_relevance, wrong_size4 = relevance_dataframe(relevance, 'colbert_avg_body_split_bm25')
colbert_relevance = pd.concat([colbert_max_body_split_relevance, colbert_avg_body_split_relevance, colbert_max_body_split_bm25_relevance, colbert_avg_body_split_bm25_relevance], ignore_index=True)
wrong_size = wrong_size1 + wrong_size2 + wrong_size3 + wrong_size4
                
colbert_relevance

# write the DataFrame to an Excel file
colbert_relevance.to_excel('output/colbert_relevance.xlsx', index=False)

In [86]:
print(wrong_size)
print(len(wrong_size))

[['colbert_max_body_split', 'What kind of soup can I make with butternut squash and coconut milk?', 51.35893630981445, 22.701366424560547, 40.27640914916992, 23.268001556396484, 31.617496490478516, 35.17253875732422, 27.517311096191406, 35.7313117980957, 66.65198516845703, 35.550331115722656, 63.48490905761719], ['colbert_max_body_split', 'spaghetti carbonara recipe', 46.71865463256836, 42.880924224853516, 39.440773010253906, 33.942081451416016, 30.9733829498291, 42.39466094970703, 56.46176528930664, 45.17765426635742, 38.933265686035156, 49.51409912109375, 57.27763748168945], ['colbert_max_body_split', 'what can I make for a romantic dinner', 33.181053161621094, 35.01700210571289, 38.285953521728516, 51.374942779541016, 67.42009735107422, 47.937355041503906, 43.95292663574219, 52.82949447631836, 32.2237548828125, 54.9410400390625, 48.95098876953125], ['colbert_avg_body_split', 'how to make hummus', 29.40514373779297, 15.404664993286133, 18.8957462310791, 7.568050384521484, 13.19954776

In [87]:
from vespa.io import VespaQueryResponse

with app.syncio(connections=1) as session:
    response:VespaQueryResponse = session.query(
        yql="select * from sources * where ({targetHits:1000}nearestNeighbor(embedding_body_split,q))",
        ranking="colbert_cross_body_split",
        query="chocolate", 
        body={
            "input.query(q)": f'embed(e5, "chocolate")',
            "input.query(qt)": f'embed(colbert, "chocolate")',
        },
    )

assert(response.is_successful())
for hit in response.hits:
    record = {}
    for field in ['id', 'title', 'body_split']:
        record[field] = hit['fields'][field]
    print(record)

{'id': 58651, 'title': 'turtle  squares', 'body_split': ['turtle  squares', 'Recipe posted on: 2003-04-07', 'Tags: 30-minutes-or-less, time-to-make, course, main-ingredient, cuisine, preparation, occasion, north-american, for-large-groups, desserts, fruit, oven, easy, finger-food, kid-friendly, cookies-and-brownies, chocolate, bar-cookies, nuts, dietary, low-sodium, low-in-something, taste-mood, sweet, equipment, number-of-servings, presentation', 'Description: for lovers of pecans and chocolate...', 'This recipe takes 30 minutes to be done.', 'For this recipe you will need the ingredients: ', 'flour, brown sugar, butter, pecans, semi-sweet chocolate chips', 'The 15 steps to make this recipe are: ', 'preheat oven to 350 degrees f, spray a 13 x 9 baking pan evenly with non-stick cooking spray, beat 1 cup brown sugar with 1 / 2 cup melted butter with an electric mixer on medium for 2-3 minutes, add the flour mixture and mix until smooth, press the flour mixture evenly and firmly into the

In [88]:
for hit in response.hits:
    cells = hit['fields']['matchfeatures']
    print(cells)

{'cos_sim_body_split': 0.6401920407274196, 'cross_max_sim_body_split': 97.20193457603455, 'max_cross_sim_body_split': 97.20193457603455}
{'cos_sim_body_split': 0.627614830999481, 'cross_max_sim_body_split': 86.64896368980408, 'max_cross_sim_body_split': 86.64896368980408}
{'cos_sim_body_split': 0.6174435864411332, 'cross_max_sim_body_split': 79.09302496910095, 'max_cross_sim_body_split': 79.09302496910095}
{'cos_sim_body_split': 0.6242752294233741, 'cross_max_sim_body_split': 76.00577592849731, 'max_cross_sim_body_split': 76.00577592849731}
{'cos_sim_body_split': 0.6043439211356324, 'cross_max_sim_body_split': 75.3615174293518, 'max_cross_sim_body_split': 75.3615174293518}
{'cos_sim_body_split': 0.6073898355517429, 'cross_max_sim_body_split': 69.2407785654068, 'max_cross_sim_body_split': 69.2407785654068}
{'cos_sim_body_split': 0.6128411898407874, 'cross_max_sim_body_split': 65.65912812948227, 'max_cross_sim_body_split': 65.65912812948227}
{'cos_sim_body_split': 0.6186847566791794, 'cr