# Long Context Colbert

Vamos criar uma aplicação do Vespa com Long Context Colbert, como pode ser visto [neste exemplo](https://pyvespa.readthedocs.io/en/latest/examples/chat_with_your_pdfs_using_colbert_langchain_and_Vespa-cloud.html).

In [1]:
# verify if the gpu is available
import torch

print(torch.cuda.is_available())

torch.device('cuda' if torch.cuda.is_available() else 'cpu')

True


device(type='cuda')

## Imports

Primeiro, vamos importar as bibliotecas necessárias para criar pacotes do Vespa.

In [2]:
from vespa.package import (
    ApplicationPackage,
    Component,
    Parameter,
    Field,
    HNSW,
    RankProfile,
    Function,
    FirstPhaseRanking,
    SecondPhaseRanking,
    FieldSet,
    DocumentSummary,
    Summary,
)
from pathlib import Path
import json
import pandas as pd
import ast
import numpy as np

from vespa.package import Schema, Document, Field, FieldSet

Vamos verificar se o Vespa está instalado:

In [3]:
!vespa --version

Usage:
  vespa [flags]
  vespa [command]

Available Commands:
  activate    Activate (deploy) a previously prepared application package
  auth        Manage Vespa Cloud credentials
  clone       Create files and directory structure from a Vespa sample application
  completion  Generate the autocompletion script for the specified shell
  config      Manage persistent values for global flags
  curl        Access Vespa directly using curl
  deploy      Deploy (prepare and activate) an application package
  destroy     Remove a deployed Vespa application and its data
  document    Issue a single document operation to Vespa
  feed        Feed multiple document operations to Vespa
  fetch       Download a deployed application package
  help        Help about any command
  log         Show the Vespa log
  prepare     Prepare an application package for activation
  prod        Deploy an application package to production in Vespa Cloud
  query       Issue a query to Vespa
  status      Show Ves

## Criação do aplicativo Vespa

Vamos criar o pacote do aplicativo Vespa, com os componentes `e5` e `colbert`:

In [4]:
from vespa.package import ApplicationPackage, Component, Parameter

vespa_app_name = "findmypasta"
app_package = ApplicationPackage(
    name=vespa_app_name,
    components=[
        Component(
            id="e5",
            type="hugging-face-embedder",
            parameters=[
                Parameter(
                    name="transformer-model",
                    args={
                        "url": "https://huggingface.co/intfloat/e5-small-v2/resolve/main/model.onnx"
                    },
                ),
                Parameter(
                    name="tokenizer-model",
                    args={
                        "url": "https://huggingface.co/intfloat/e5-small-v2/raw/main/tokenizer.json"
                    },
                ),
            ],
        ),
        Component(
            id="colbert",
            type="colbert-embedder",
            parameters=[
                Parameter(
                    name="transformer-model",
                    args={
                        "url": "https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/model.onnx"
                    },
                ),
                Parameter(
                    name="tokenizer-model",
                    args={
                        "url": "https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json"
                    },
                ),
            ],
        ),
    ],
)

Vamos criar o *Schema* com os campos da nossa receita, o `embedding` e o `colbert`:

In [5]:
app_package.schema.add_fields(
    Field(name="id", type="int", indexing=["attribute", "summary"]),
    Field(
        name="title", type="string", indexing=["index", "summary"], index="enable-bm25"
    ),
    # Field(
    #     name="description", type="string", indexing=["index", "summary"], index="enable-bm25"
    # ),
    # Field(
    #     name="minutes",
    #     type="string",
    #     indexing=["summary"],
    # ),
    # Field(
    #     name="n_steps",
    #     type="string",
    #     indexing=["attribute", "summary"],
    # ),
    # Field(
    #     name="n_ingredients",
    #     type="string",
    #     indexing=["attribute", "summary"],
    # ),
    # Field(
    #     name="submitted",
    #     type="string",
    #     indexing=["attribute", "summary"],
    # ),
    Field(
        name="body",
        type="string", 
        indexing=["index", "summary"],
        index="enable-bm25",
        bolding=True
    ),
    Field(
        name = "body_split",
        type = "array<string>",
        indexing = ["index", "summary"],
        index = "enable-bm25",
        bolding = True,
    ),
    # Field(
    #     name="tags",
    #     type="array<string>",
    #     indexing=["index", "summary"],
    #     index="enable-bm25",
    #     bolding=True,
    # ),
    Field(
        name="steps",
        type="array<string>",
        indexing=["index", "summary"],
        index="enable-bm25",
        bolding=True,
    ),
    Field(
        name="ingredients",
        type="array<string>",
        indexing=["index", "summary"],
        index="enable-bm25",
        bolding=True,
    ),
    #Field(
    #    name="colbert",
    #    type="tensor<int8>(token{},v[16])",
    #    indexing=["attribute", "summary", "index"],
    #    attribute=["distance-metric:hamming"],
    #)
    Field(
    name="embedding_body_split",
    type="tensor<bfloat16>(body_split{}, x[384])",
    indexing=[
        "input body_split",
        "embed e5",
        "attribute",
    ],
    attribute=["distance-metric: angular"],
    is_document_field=False,
    ),
    Field(
    name="colbert_body_split",
    type="tensor<int8>(body_split{}, token{}, v[16])",
    indexing=["input body_split", "embed colbert body_split", "attribute"],
    is_document_field=False,
    ),
    Field(
    name="embedding_steps",
    type="tensor<bfloat16>(steps{}, x[384])",
    indexing=[
        "input steps",
        "embed e5",
        "attribute",
    ],
    attribute=["distance-metric: angular"],
    is_document_field=False,
    ),
    Field(
    name="colbert_steps",
    type="tensor<int8>(steps{}, token{}, v[16])",
    indexing=["input steps", "embed colbert steps", "attribute"],
    is_document_field=False,
    ),
    Field(
    name="embedding_ingredients",
    type="tensor<bfloat16>(ingredients{}, x[384])",
    indexing=[
        "input ingredients",
        "embed e5",
        "attribute",
    ],
    attribute=["distance-metric: angular"],
    is_document_field=False,
    ),
    Field(
    name="colbert_ingredients",
    type="tensor<int8>(ingredients{}, token{}, v[16])",
    indexing=["input ingredients", "embed colbert ingredients", "attribute"],
    is_document_field=False,
    )
)
    

Vamos criar o *RankProfile* com o [Colbert Context-Level](https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/):

In [6]:
colbert_max_body_split = RankProfile(
    name="colbert_max_body_split",
    inputs=[
        ("query(q)", "tensor<float>(x[384])"),
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
    ],
    functions=[
        Function(name="cos_sim_body_split", expression="closeness(field, embedding_body_split)"),
        Function(
            name="max_sim_per_context_body_split",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert_body_split)), v
                        ),
                        max, token
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="max_sim_body_split", expression="reduce(max_sim_per_context_body_split, max, body_split)"
            
        ),
    ],
    first_phase=FirstPhaseRanking(expression="cos_sim_body_split"),
    second_phase=SecondPhaseRanking(expression="max_sim_body_split"),
    match_features=["cos_sim_body_split", "max_sim_body_split", "max_sim_per_context_body_split"],
)

colbert_avg_body_split = RankProfile(
    name="colbert_avg_body_split",
    inputs=[
        ("query(q)", "tensor<float>(x[384])"),
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
    ],
    functions=[
        Function(name="cos_sim_body_split", expression="closeness(field, embedding_body_split)"),
        Function(
            name="max_sim_per_context_body_split",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert_body_split)), v
                        ),
                        max, token
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="avg_sim_body_split", expression="reduce(max_sim_per_context_body_split, avg, body_split)"
        ),
    ],
    first_phase=FirstPhaseRanking(expression="cos_sim_body_split"),
    second_phase=SecondPhaseRanking(expression="avg_sim_body_split"),
    match_features=["cos_sim_body_split", "avg_sim_body_split", "max_sim_per_context_body_split"],
)

colbert_max_steps= RankProfile(
    name="colbert_max_steps",
    inputs=[
        ("query(q)", "tensor<float>(x[384])"),
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
    ],
    functions=[
        Function(name="cos_sim_steps", expression="closeness(field, embedding_steps)"),
        Function(
            name="max_sim_per_context_steps",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert_steps)), v
                        ),
                        max, token
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="max_sim_steps", expression="reduce(max_sim_per_context_steps, max, steps)"
            
        ),
    ],
    first_phase=FirstPhaseRanking(expression="cos_sim_steps"),
    second_phase=SecondPhaseRanking(expression="max_sim_steps"),
    match_features=["cos_sim_steps", "max_sim_steps", "max_sim_per_context_steps"],
)

colbert_avg_steps = RankProfile(
    name="colbert_avg_steps",
    inputs=[
        ("query(q)", "tensor<float>(x[384])"),
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
    ],
    functions=[
        Function(name="cos_sim_steps", expression="closeness(field, embedding_steps)"),
        Function(
            name="max_sim_per_context_steps",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert_steps)), v
                        ),
                        max, token
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="avg_sim_steps", expression="reduce(max_sim_per_context_steps, avg, steps)"
        ),
    ],
    first_phase=FirstPhaseRanking(expression="cos_sim_steps"),
    second_phase=SecondPhaseRanking(expression="avg_sim_steps"),
    match_features=["cos_sim_steps", "avg_sim_steps", "max_sim_per_context_steps"],
)

colbert_max_ingredients = RankProfile(
    name="colbert_max_ingredients",
    inputs = [
        ("query(q)", "tensor<float>(x[384])"),
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
    ],
    functions=[
        Function(name="cos_sim_ingredients", expression="closeness(field, embedding_ingredients)"),
        Function(
            name="max_sim_per_context_ingredients",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert_ingredients)), v
                        ),
                        max, token
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="max_sim_ingredients", expression="reduce(max_sim_per_context_ingredients, max, ingredients)"
        ),
    ],
    first_phase=FirstPhaseRanking(expression="cos_sim_ingredients"),
    second_phase=SecondPhaseRanking(expression="max_sim_ingredients"),
    match_features=["cos_sim_ingredients", "max_sim_ingredients", "max_sim_per_context_ingredients"],
)

colbert_avg_ingredients = RankProfile(
    name="colbert_avg_ingredients",
    inputs = [
        ("query(q)", "tensor<float>(x[384])"),
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
    ],
    functions=[
        Function(name="cos_sim_ingredients", expression="closeness(field, embedding_ingredients)"),
        Function(
            name="max_sim_per_context_ingredients",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert_ingredients)), v
                        ),
                        max, token
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="avg_sim_ingredients", expression="reduce(max_sim_per_context_ingredients, avg, ingredients)"
        ),
    ],
    first_phase=FirstPhaseRanking(expression="cos_sim_ingredients"),
    second_phase=SecondPhaseRanking(expression="avg_sim_ingredients"),
    match_features=["cos_sim_ingredients", "avg_sim_ingredients", "max_sim_per_context_ingredients"],
)

colbert_max_body_split_bm25 = RankProfile(
    name="colbert_max_body_split_bm25",
    inherits="colbert_max_body_split",
    first_phase=FirstPhaseRanking(expression="cos_sim_body_split"),
    second_phase=SecondPhaseRanking(expression="max_sim_body_split + bm25(title)"),
)

colbert_avg_body_split_bm25 = RankProfile(
    name="colbert_avg_body_split_bm25",
    inherits="colbert_avg_body_split",
    first_phase=FirstPhaseRanking(expression="cos_sim_body_split"),
    second_phase=SecondPhaseRanking(expression="avg_sim_body_split + bm25(title)"),
)

colbert_max_steps_bm25 = RankProfile(
    name="colbert_max_steps_bm25",
    inherits="colbert_max_steps",
    first_phase=FirstPhaseRanking(expression="cos_sim_steps"),
    second_phase=SecondPhaseRanking(expression="max_sim_steps + bm25(title)"),
)

colbert_avg_steps_bm25 = RankProfile(
    name="colbert_avg_steps_bm25",
    inherits="colbert_avg_steps",
    first_phase=FirstPhaseRanking(expression="cos_sim_steps"),
    second_phase=SecondPhaseRanking(expression="avg_sim_steps + bm25(title)"),
)

colbert_max_ingredients_bm25 = RankProfile(
    name="colbert_max_ingredients_bm25",
    inherits="colbert_max_ingredients",
    first_phase=FirstPhaseRanking(expression="cos_sim_ingredients"),
    second_phase=SecondPhaseRanking(expression="max_sim_ingredients + bm25(title)"),
)

colbert_avg_ingredients_bm25 = RankProfile(
    name="colbert_avg_ingredients_bm25",
    inherits="colbert_avg_ingredients",
    first_phase=FirstPhaseRanking(expression="cos_sim_ingredients"),
    second_phase=SecondPhaseRanking(expression="avg_sim_ingredients + bm25(title)"),
)


app_package.schema.add_rank_profile(colbert_max_body_split)
app_package.schema.add_rank_profile(colbert_avg_body_split)
app_package.schema.add_rank_profile(colbert_max_steps)
app_package.schema.add_rank_profile(colbert_avg_steps)
app_package.schema.add_rank_profile(colbert_max_ingredients)
app_package.schema.add_rank_profile(colbert_avg_ingredients)
app_package.schema.add_rank_profile(colbert_max_body_split_bm25)
app_package.schema.add_rank_profile(colbert_avg_body_split_bm25)
app_package.schema.add_rank_profile(colbert_max_steps_bm25)
app_package.schema.add_rank_profile(colbert_avg_steps_bm25)
app_package.schema.add_rank_profile(colbert_max_ingredients_bm25)
app_package.schema.add_rank_profile(colbert_avg_ingredients_bm25)

In [7]:

#Path("pkg").mkdir(parents=True, exist_ok=True)
#app_package.to_files("pkg")

In [8]:

#! mkdir -p pkg/model
#! curl -L -o pkg/model/tokenizer.json \
#  https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json

#! curl -L -o pkg/model/model.onnx \
#  https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/model.onnx

Vamos realizar o *deploy* do pacote do Vespa pelo Docker:

In [9]:
from vespa.deployment import VespaDocker

#vespa_docker = VespaDocker()
#app = vespa_docker.deploy_from_disk(application_name="findmypasta", application_root="pkg")

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=app_package)

Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for configuration server, 10/300 seconds...
Waiting for configuration server, 15/300 seconds...
Waiting for configuration server, 20/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 10/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 15/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 20/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 25/300 seconds...


## Fornecendo dados

Vamos criar uma função que cria o campo `body`:

In [10]:
def recipe_file_body_lines(recipe, complementary_data = None):
    """
    Function responsible for creating the recipe body.
    """
    # Transformar as colunas de strings para listas
    recipe['tags'] = recipe['tags'].strip("[]").replace("'", "").split(', ')
    recipe['steps'] = recipe['steps'].strip("[]").replace("'", "").split(', ')
    recipe['ingredients'] = recipe['ingredients'].strip("[]").replace("'", "").split(', ')

    # reviews = complementary_data[complementary_data['recipe_id'] == recipe['id']]

    # # ordering by descending date
    # reviews = reviews.sort_values('date', ascending=False)

    # # getting the average rating
    # avg_rating = reviews['rating'].mean()

    # # if the average rating is NaN, we will set it to "No reviews"
    # if np.isnan(avg_rating):
    #     avg_rating = "No reviews"

    # creating the recipe body
    recipe_body = recipe['name'] + '\n' \
    + "Recipe posted on: " + str(recipe['submitted']) + '\n' \
    + "Tags: " + ', '.join(recipe['tags']) + '\n' \
    + "Description: " + recipe['description'] + '\n' \
    + "This recipe takes " + str(recipe['minutes']) + " minutes to be done." + '\n' \
    + "For this recipe you will need the ingredients: " + '\n' \
    + ', '.join(recipe['ingredients']) + '\n' \
    + "The " + str(recipe["n_steps"]) + " steps to make this recipe are: " + '\n' \
    + ', '.join(recipe['steps']) 
    return recipe_body

In [11]:
# Função para aplicar recipe_file_body_lines a cada linha do DataFrame de receitas
def apply_recipe_file_body_lines(recipe_row):
    return recipe_file_body_lines(recipe_row)

Agora, pegamos os dados presentes no dataset `../input/RAW_recipes.csv`, definimos os campos que serão enviados e formatamos para o formato do Vespa:

In [12]:
df = pd.read_csv('../input/RAW_recipes.csv')
###
##

# print columns
print(df.columns)
df = df.dropna()
df = df.reset_index(drop=True)

df['body'] = df.apply(apply_recipe_file_body_lines, axis=1)
df['body_split'] = df['body'].str.split('\n')

df['minutes'] = "This recipe takes " + df['minutes'].astype(str) + " minutes to be done."
df['submitted'] = 'Recipe submitted on: ' + df["submitted"]
df['tags'] = df["tags"]
df['n_steps'] = 'Number of steps to make this recipe: ' + df['n_steps'].astype(str)
df['n_ingredients'] = 'Number of ingredients: ' + df['n_ingredients'].astype(str)
df['steps'] = df["steps"]
df['description'] = df["description"]
df['ingredients'] = df["ingredients"]
df['title'] = df['name']

namespace = "recipes"
document_type = "findmypasta"

def to_vespa_format(x):
    document_id = f"id:{namespace}:{document_type}::{x['id']}"
    return {
        "put": document_id,
        "fields": {
            "id": x["id"],
            "title": x["name"],
            #"tags": ast.literal_eval(x["tags"]),
            "steps": ast.literal_eval(x["steps"]),
            #"description": x["description"],
            "ingredients": ast.literal_eval(x["ingredients"]),
            #"minutes": x["minutes"],
            #"n_steps": x["n_steps"],
            #"n_ingredients": x["n_ingredients"],
            #"submitted": x["submitted"],
            #"body": x["body"],
            "body_split": x["body_split"]
        }
    }

vespa_feed = df.apply(to_vespa_format, axis=1).tolist()
vespa_feed_slice = vespa_feed[0:100]
print(vespa_feed_slice[0])

Index(['name', 'id', 'minutes', 'contributor_id', 'submitted', 'tags',
       'nutrition', 'n_steps', 'steps', 'description', 'ingredients',
       'n_ingredients'],
      dtype='object')
{'put': 'id:recipes:findmypasta::137739', 'fields': {'id': 137739, 'title': 'arriba   baked winter squash mexican style', 'steps': ['make a choice and proceed with recipe', 'depending on size of squash , cut into half or fourths', 'remove seeds', 'for spicy squash , drizzle olive oil or melted butter over each cut squash piece', 'season with mexican seasoning mix ii', 'for sweet squash , drizzle melted honey , butter , grated piloncillo over each cut squash piece', 'season with sweet mexican spice mix', 'bake at 350 degrees , again depending on size , for 40 minutes up to an hour , until a fork can easily pierce the skin', 'be careful not to burn the squash especially if you opt to use sugar or butter', 'if you feel more comfortable , cover the squash with aluminum foil the first half hour , give or t

Cria o json com os campos formatados:

In [13]:
with open("vespa_feed2.jsonl", "w") as f:
    for item in vespa_feed_slice:
        f.write(json.dumps(item) + "\n")

Alimenta o Vespa com os documentos:

In [14]:
! vespa config set target local
! vespa feed vespa_feed2.jsonl

{
  "feeder.operation.count": 100,
  "feeder.seconds": 134.560,
  "feeder.ok.count": 100,
  "feeder.ok.rate": 0.743,
  "feeder.error.count": 0,
  "feeder.inflight.count": 0,
  "http.request.count": 100,
  "http.request.bytes": 81098,
  "http.request.MBps": 0.001,
  "http.exception.count": 0,
  "http.response.count": 100,
  "http.response.bytes": 9422,
  "http.response.MBps": 0.000,
  "http.response.error.count": 0,
  "http.response.latency.millis.min": 12402,
  "http.response.latency.millis.avg": 21720,
  "http.response.latency.millis.max": 44415,
  "http.response.code.counts": {
    "200": 100
  }
}


In [15]:
documents = app.query(yql = "select * from sources * where true")
documents.number_documents_indexed

100

## Queries

Vamos pegar as queries que serão feitas, presentes no dataset `../input/Recipe_Search_Questions.xlsx`:

In [16]:
# loading the Questions.xlsx and answering each question query
import pandas as pd
questions = pd.read_excel('../input/Questions.xlsx')
questions = pd.read_excel('../input/Recipe_Search_Questions.xlsx')
questions.head()

Unnamed: 0,Tipo,Descrição,Query
0,Keywords,Pergunta simples,grilled cheese sandwich recipe
1,Keywords,Pergunta simples,mango smoothie
2,Semantica,Pergunta média,gluten-free bread without yeast
3,Semantica,Pergunta média,low carb dessert for diabetics
4,Semantica,Pergunta difícil,traditional Japanese breakfast for a family


Geramos o arquivo de output com as respostas para cada *query*:

In [17]:
from vespa.io import VespaQueryResponse
import json

# Supondo que 'questions' é um DataFrame com colunas ['Query', 'Tipo', 'Descrição']
data = pd.DataFrame(columns=['id', 'title', 'Query', 'Tipo', 'Descrição'])

model_to_ranking_dict = {
    "colbert_max_body_split": "colbert_max_body_split",
    "colbert_avg_body_split": "colbert_avg_body_split",
    "colbert_max_steps": "colbert_max_steps",
    "colbert_avg_steps": "colbert_avg_steps",
    "colbert_max_ingredients": "colbert_max_ingredients",
    "colbert_avg_ingredients": "colbert_avg_ingredients",
    "colbert_max_body_split_bm25": "colbert_max_body_split_bm25",
    "colbert_avg_body_split_bm25": "colbert_avg_body_split_bm25",
    "colbert_max_steps_bm25": "colbert_max_steps_bm25",
    "colbert_avg_steps_bm25": "colbert_avg_steps_bm25",
    "colbert_max_ingredients_bm25": "colbert_max_ingredients_bm25",
    "colbert_avg_ingredients_bm25": "colbert_avg_ingredients_bm25",
}

embeddings = {
    "colbert_max_body_split": "embedding_body_split",
    "colbert_avg_body_split": "embedding_body_split",
    "colbert_max_steps": "embedding_steps",
    "colbert_avg_steps": "embedding_steps",
    "colbert_max_ingredients": "embedding_ingredients",
    "colbert_avg_ingredients": "embedding_ingredients",
    "colbert_max_body_split_bm25": "embedding_body_split",
    "colbert_avg_body_split_bm25": "embedding_body_split",
    "colbert_max_steps_bm25": "embedding_steps",
    "colbert_avg_steps_bm25": "embedding_steps",
    "colbert_max_ingredients_bm25": "embedding_ingredients",
    "colbert_avg_ingredients_bm25": "embedding_ingredients"
}

for selected_model in model_to_ranking_dict.keys():
    output_name = 'output/Results_' + selected_model + '_extraQuestions' + '.xlsx'
    embedding = embeddings[selected_model]

    if model_to_ranking_dict[selected_model] is not None:
        i = 0
        for input_query in questions['Query']:
            # save a checkpoint each 100 queries
            if i % 100 == 0:
                data.to_excel(output_name, index=False)

            with app.syncio(connections=1) as session:
                try:
                    response: VespaQueryResponse = session.query(
                        yql="select * from sources * where ({targetHits:1000}nearestNeighbor(" + embedding + ",q)) limit 5",
                        query=input_query,
                        ranking=model_to_ranking_dict[selected_model],
                        body={
                            "input.query(q)": f"embed(e5, \"{input_query}\")",
                            "input.query(qt)": f"embed(colbert, \"{input_query}\")",
                            # "input.query(q)": f"embed({input_query})",
                            #"timeout": "30s"  # Aumentar o tempo limite para 10 segundos
                        }
                    )
                    assert response.is_successful()
                except Exception as e:
                    print(f"Error with query '{input_query}': {e}")
                    continue

                for hit in response.hits:
                    record = {}
                    for field in ['id', 'title']:
                        record[field] = hit['fields'].get(field, None)
                    record["Query"] = input_query
                    record["Tipo"] = questions[questions['Query'] == input_query]['Tipo'].values[0]
                    record["Descrição"] = questions[questions['Query'] == input_query]['Descrição'].values[0]
                    data = pd.concat([data, pd.DataFrame([record])], ignore_index=True)

            i += 1

        # Sorting
        data = data.sort_values(by=['Tipo', 'Query'])

        # reordering columns
        data = data[['Tipo', 'Descrição', 'Query', 'id', 'title']]

        # exporting to excel
        data.to_excel(output_name, index=False)


Respostas para a query `chocolate`:

In [18]:
from vespa.io import VespaQueryResponse

with app.syncio(connections=1) as session:
    response:VespaQueryResponse = session.query(
        yql="select * from sources * where ({targetHits:1000}nearestNeighbor(embedding_body_split,q))",
        ranking="colbert_max_body_split",
        query="chocolate", 
        body={
            "input.query(q)": f'embed(e5, "chocolate")',
            "input.query(qt)": f'embed(colbert, "chocolate")',
        },
    )

assert(response.is_successful())
for hit in response.hits:
    record = {}
    for field in ['id', 'title', 'body_split']:
        record[field] = hit['fields'][field]
    print(record)

{'id': 58651, 'title': 'turtle  squares', 'body_split': ['turtle  squares', 'Recipe posted on: 2003-04-07', 'Tags: 30-minutes-or-less, time-to-make, course, main-ingredient, cuisine, preparation, occasion, north-american, for-large-groups, desserts, fruit, oven, easy, finger-food, kid-friendly, cookies-and-brownies, chocolate, bar-cookies, nuts, dietary, low-sodium, low-in-something, taste-mood, sweet, equipment, number-of-servings, presentation', 'Description: for lovers of pecans and chocolate...', 'This recipe takes 30 minutes to be done.', 'For this recipe you will need the ingredients: ', 'flour, brown sugar, butter, pecans, semi-sweet chocolate chips', 'The 15 steps to make this recipe are: ', 'preheat oven to 350 degrees f, spray a 13 x 9 baking pan evenly with non-stick cooking spray, beat 1 cup brown sugar with 1 / 2 cup melted butter with an electric mixer on medium for 2-3 minutes, add the flour mixture and mix until smooth, press the flour mixture evenly and firmly into the

Podemos ver a relevância total e média de cada canto, sendo que:

- `0`:  "recipe_body = recipe['name'] + '\n'";
- `1`:  "Recipe posted on: " + str(recipe['submitted']) + '\n'";
- `2`:  "Tags: " + ', '.join(recipe['tags']) + '\n'";
- `3`:  "Description: " + recipe['description'] + '\n'";
- `4`:  "This recipe takes " + str(recipe['minutes']) + " minutes to be done." + '\n'";
- `5`:  "For this recipe you will need the ingredients: " + '\n'";
- `6`:  ', '.join(recipe['ingredients']) + '\n'";
- `7`:  "The " + str(recipe["n_steps"]) + " steps to make this recipe are: " + '\n'";
- `8`:  ', '.join(recipe['steps'])";

In [19]:
#response.hits[0]
# get the cells of the first hit
#print(response.hits[0]['fields']['matchfeatures']['max_sim_per_context']['cells'])

total = {'0': 0, '1': 0, '2': 0, '3': 0, '4': 0, '5': 0, '6': 0, '7': 0, '8': 0}

for hit in response.hits:
    cells = hit['fields']['matchfeatures']['max_sim_per_context_body_split']['cells']
    for key in total.keys():
        total[key] += cells[key]

median = total 
median = {k: v / len(response.hits) for k, v in median.items()}

print(total)
print(median)

{'0': 420.0016670227051, '1': 274.50904846191406, '2': 308.63941764831543, '3': 416.2218737602234, '4': 181.57076835632324, '5': 84.61458206176758, '6': 570.7690467834473, '7': 135.15281772613525, '8': 338.6511507034302}
{'0': 42.000166702270505, '1': 27.450904846191406, '2': 30.863941764831544, '3': 41.62218737602234, '4': 18.157076835632324, '5': 8.461458206176758, '6': 57.076904678344725, '7': 13.515281772613525, '8': 33.865115070343016}
