# Long Context Colbert

Vamos criar uma aplicação do Vespa com Long Context Colbert, como pode ser visto [neste exemplo](https://pyvespa.readthedocs.io/en/latest/examples/chat_with_your_pdfs_using_colbert_langchain_and_Vespa-cloud.html).

## Imports

Primeiro, vamos importar as bibliotecas necessárias para criar pacotes do Vespa.

In [4]:
from vespa.package import (
    ApplicationPackage,
    Component,
    Parameter,
    Field,
    HNSW,
    RankProfile,
    Function,
    FirstPhaseRanking,
    SecondPhaseRanking,
    FieldSet,
    DocumentSummary,
    Summary,
)
from pathlib import Path
import json
import pandas as pd
import ast
import numpy as np

from vespa.package import Schema, Document, Field, FieldSet

Vamos verificar se o Vespa está instalado:

In [5]:
!vespa --version

Usage:
  vespa [flags]
  vespa [command]

Available Commands:
  activate    Activate (deploy) a previously prepared application package
  auth        Manage Vespa Cloud credentials
  clone       Create files and directory structure from a Vespa sample application
  completion  Generate the autocompletion script for the specified shell
  config      Manage persistent values for global flags
  curl        Access Vespa directly using curl
  deploy      Deploy (prepare and activate) an application package
  destroy     Remove a deployed Vespa application and its data
  document    Issue a single document operation to Vespa
  feed        Feed multiple document operations to Vespa
  fetch       Download a deployed application package
  help        Help about any command
  log         Show the Vespa log
  prepare     Prepare an application package for activation
  prod        Deploy an application package to production in Vespa Cloud
  query       Issue a query to Vespa
  status      Show Ves

## Criação do aplicativo Vespa

Vamos criar o pacote do aplicativo Vespa, com os componentes `e5` e `colbert`:

In [6]:
from vespa.package import ApplicationPackage, Component, Parameter

vespa_app_name = "findmypasta"
app_package = ApplicationPackage(
    name=vespa_app_name,
    components=[
        Component(
            id="e5",
            type="hugging-face-embedder",
            parameters=[
                Parameter(
                    name="transformer-model",
                    args={
                        "url": "https://huggingface.co/intfloat/e5-small-v2/resolve/main/model.onnx"
                    },
                ),
                Parameter(
                    name="tokenizer-model",
                    args={
                        "url": "https://huggingface.co/intfloat/e5-small-v2/raw/main/tokenizer.json"
                    },
                ),
            ],
        ),
        Component(
            id="colbert",
            type="colbert-embedder",
            parameters=[
                Parameter(
                    name="transformer-model",
                    args={
                        "url": "https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/model.onnx"
                    },
                ),
                Parameter(
                    name="tokenizer-model",
                    args={
                        "url": "https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json"
                    },
                ),
            ],
        ),
    ],
)

Vamos criar o *Schema* com os campos da nossa receita, o `embedding` e o `colbert`:

In [7]:
app_package.schema.add_fields(Field(name="id", type="int", indexing=["attribute", "summary"]),
    Field(
        name="title", type="string", indexing=["index", "summary"], index="enable-bm25"
    ),
    Field(
        name="description", type="string", indexing=["index", "summary"], index="enable-bm25"
    ),
    Field(
        name="minutes",
        type="string",
        indexing=["summary"],
    ),
    Field(
        name="n_steps",
        type="string",
        indexing=["attribute", "summary"],
    ),
    Field(
        name="n_ingredients",
        type="string",
        indexing=["attribute", "summary"],
    ),
    Field(
        name="submitted",
        type="string",
        indexing=["attribute", "summary"],
    ),
    Field(
        name="body",
        type="string", 
        indexing=["index", "summary"],
        index="enable-bm25",
        bolding=True
    ),
    Field(
        name = "body_split",
        type = "array<string>",
        indexing = ["index", "summary"],
        index = "enable-bm25",
        bolding = True,
    ),
    Field(
        name="tags",
        type="array<string>",
        indexing=["index", "summary"],
        index="enable-bm25",
        bolding=True,
    ),
    Field(
        name="steps",
        type="array<string>",
        indexing=["index", "summary"],
        index="enable-bm25",
        bolding=True,
    ),
    Field(
        name="ingredients",
        type="array<string>",
        indexing=["index", "summary"],
        index="enable-bm25",
        bolding=True,
    ),
    #Field(
    #    name="colbert",
    #    type="tensor<int8>(token{},v[16])",
    #    indexing=["attribute", "summary", "index"],
    #    attribute=["distance-metric:hamming"],
    #)
    Field(
    name="embedding",
    type="tensor<bfloat16>(body_split{}, x[384])",
    indexing=[
        "input body_split",
        'for_each { (input title || "") . " " . ( _ || "") }',
        "embed e5",
        "attribute",
    ],
    attribute=["distance-metric: angular"],
    is_document_field=False,
    ),
    Field(
        name="colbert",
        type="tensor<int8>(body_split{}, token{}, v[16])",
        indexing=["input body_split", "embed colbert body_split", "attribute"],
        is_document_field=False,
    )
)
    

Vamos criar o *RankProfile* com o [Colbert Context-Level](https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/):

In [8]:
colbert = RankProfile(
    name="colbert_context_level",
    inputs=[
        ("query(q)", "tensor<float>(x[384])"),
        ("query(qt)", "tensor<float>(querytoken{}, v[128])"),
    ],
    functions=[
        Function(name="cos_sim", expression="closeness(field, embedding)"),
        Function(
            name="max_sim_per_context",
            expression="""
                sum(
                    reduce(
                        sum(
                            query(qt) * unpack_bits(attribute(colbert)) , v
                        ),
                        max, token
                    ),
                    querytoken
                )
            """,
        ),
        Function(
            name="max_sim", expression="reduce(max_sim_per_context, max, body_split)"
            
        ),
    ],
    first_phase=FirstPhaseRanking(expression="cos_sim"),
    second_phase=SecondPhaseRanking(expression="max_sim"),
    match_features=["cos_sim", "max_sim", "max_sim_per_context"],
)
app_package.schema.add_rank_profile(colbert)

In [9]:

#Path("pkg").mkdir(parents=True, exist_ok=True)
#app_package.to_files("pkg")

In [10]:

#! mkdir -p pkg/model
#! curl -L -o pkg/model/tokenizer.json \
#  https://huggingface.co/colbert-ir/colbertv2.0/raw/main/tokenizer.json

#! curl -L -o pkg/model/model.onnx \
#  https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/model.onnx

Vamos realizar o *deploy* do pacote do Vespa pelo Docker:

In [11]:
from vespa.deployment import VespaDocker

#vespa_docker = VespaDocker()
#app = vespa_docker.deploy_from_disk(application_name="findmypasta", application_root="pkg")

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=app_package)

Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for configuration server, 10/300 seconds...
Waiting for configuration server, 15/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 10/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 15/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 20/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 25/300 seconds...
Using plain http against endpoint http://localhost:8

## Fornecendo dados

Vamos criar uma função que cria o campo `body`:

In [12]:
def recipe_file_body_lines(recipe, complementary_data = None):
    """
    Function responsible for creating the recipe body.
    """
    # Transformar as colunas de strings para listas
    recipe['tags'] = recipe['tags'].strip("[]").replace("'", "").split(', ')
    recipe['steps'] = recipe['steps'].strip("[]").replace("'", "").split(', ')
    recipe['ingredients'] = recipe['ingredients'].strip("[]").replace("'", "").split(', ')

    # reviews = complementary_data[complementary_data['recipe_id'] == recipe['id']]

    # # ordering by descending date
    # reviews = reviews.sort_values('date', ascending=False)

    # # getting the average rating
    # avg_rating = reviews['rating'].mean()

    # # if the average rating is NaN, we will set it to "No reviews"
    # if np.isnan(avg_rating):
    #     avg_rating = "No reviews"

    # creating the recipe body
    recipe_body = recipe['name'] + '\n' \
    + "Recipe posted on: " + str(recipe['submitted']) + '\n' \
    + "Tags: " + ', '.join(recipe['tags']) + '\n' \
    + "Description: " + recipe['description'] + '\n' \
    + "This recipe takes " + str(recipe['minutes']) + " minutes to be done." + '\n' \
    + "For this recipe you will need the ingredients: " + '\n' \
    + ', '.join(recipe['ingredients']) + '\n' \
    + "The " + str(recipe["n_steps"]) + " steps to make this recipe are: " + '\n' \
    + ', '.join(recipe['steps']) 
    return recipe_body

In [13]:
# Função para aplicar recipe_file_body_lines a cada linha do DataFrame de receitas
def apply_recipe_file_body_lines(recipe_row):
    return recipe_file_body_lines(recipe_row)

Agora, pegamos os dados presentes no dataset `../input/RAW_recipes.csv`, definimos os campos que serão enviados e formatamos para o formato do Vespa:

In [14]:
df = pd.read_csv('../input/RAW_recipes.csv')
###
##

# print columns
print(df.columns)
df = df.dropna()
df = df.reset_index(drop=True)

df['body'] = df.apply(apply_recipe_file_body_lines, axis=1)
df['body_split'] = df['body'].str.split('\n')

df['minutes'] = "This recipe takes " + df['minutes'].astype(str) + " minutes to be done."
df['submitted'] = 'Recipe submitted on: ' + df["submitted"]
df['tags'] = df["tags"]
df['n_steps'] = 'Number of steps to make this recipe: ' + df['n_steps'].astype(str)
df['n_ingredients'] = 'Number of ingredients: ' + df['n_ingredients'].astype(str)
df['steps'] = df["steps"]
df['description'] = df["description"]
df['ingredients'] = df["ingredients"]
df['title'] = df['name']

namespace = "recipes"
document_type = "findmypasta"

def to_vespa_format(x):
    document_id = f"id:{namespace}:{document_type}::{x['id']}"
    return {
        "put": document_id,
        "fields": {
            "id": x["id"],
            "title": x["name"],
            "tags": ast.literal_eval(x["tags"]),
            "steps": ast.literal_eval(x["steps"]),
            "description": x["description"],
            "ingredients": ast.literal_eval(x["ingredients"]),
            "minutes": x["minutes"],
            "n_steps": x["n_steps"],
            "n_ingredients": x["n_ingredients"],
            "submitted": x["submitted"],
            "body": x["body"],
            "body_split": x["body_split"]
        }
    }

vespa_feed = df.apply(to_vespa_format, axis=1).tolist()
vespa_feed_slice = vespa_feed[0:1000]
print(vespa_feed_slice[0])

Index(['name', 'id', 'minutes', 'contributor_id', 'submitted', 'tags',
       'nutrition', 'n_steps', 'steps', 'description', 'ingredients',
       'n_ingredients'],
      dtype='object')
{'put': 'id:recipes:findmypasta::137739', 'fields': {'id': 137739, 'title': 'arriba   baked winter squash mexican style', 'tags': ['60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'cuisine', 'preparation', 'occasion', 'north-american', 'side-dishes', 'vegetables', 'mexican', 'easy', 'fall', 'holiday-event', 'vegetarian', 'winter', 'dietary', 'christmas', 'seasonal', 'squash'], 'steps': ['make a choice and proceed with recipe', 'depending on size of squash , cut into half or fourths', 'remove seeds', 'for spicy squash , drizzle olive oil or melted butter over each cut squash piece', 'season with mexican seasoning mix ii', 'for sweet squash , drizzle melted honey , butter , grated piloncillo over each cut squash piece', 'season with sweet mexican spice mix', 'bake at 350 degrees , agai

Cria o json com os campos formatados:

In [15]:
with open("vespa_feed2.jsonl", "w") as f:
    for item in vespa_feed_slice:
        f.write(json.dumps(item) + "\n")

Alimenta o Vespa com os documentos:

In [16]:
! vespa config set target local
! vespa feed vespa_feed2.jsonl

{
  "feeder.operation.count": 1000,
  "feeder.seconds": 750.249,
  "feeder.ok.count": 1000,
  "feeder.ok.rate": 1.333,
  "feeder.error.count": 0,
  "feeder.inflight.count": 0,
  "http.request.count": 1000,
  "http.request.bytes": 1032492,
  "http.request.MBps": 0.001,
  "http.exception.count": 0,
  "http.response.count": 1000,
  "http.response.bytes": 95492,
  "http.response.MBps": 0.000,
  "http.response.error.count": 0,
  "http.response.latency.millis.min": 4615,
  "http.response.latency.millis.avg": 14070,
  "http.response.latency.millis.max": 38188,
  "http.response.code.counts": {
    "200": 1000
  }
}


In [17]:
documents = app.query(yql = "select * from sources * where true")
documents.number_documents_indexed

1000

## Queries

Vamos pegar as queries que serão feitas, presentes no dataset `../input/Recipe_Search_Questions.xlsx`:

In [18]:
# loading the Questions.xlsx and answering each question query
import pandas as pd
questions = pd.read_excel('../input/Questions.xlsx')
questions = pd.read_excel('../input/Recipe_Search_Questions.xlsx')
questions.head()

Unnamed: 0,Tipo,Descrição,Query
0,Keywords,Pergunta simples,grilled cheese sandwich recipe
1,Keywords,Pergunta simples,mango smoothie
2,Semantica,Pergunta média,gluten-free bread without yeast
3,Semantica,Pergunta média,low carb dessert for diabetics
4,Semantica,Pergunta difícil,traditional Japanese breakfast for a family


Geramos o arquivo de output com as respostas para cada *query*:

In [19]:
from vespa.io import VespaQueryResponse
import json

# Supondo que 'questions' é um DataFrame com colunas ['Query', 'Tipo', 'Descrição']
data = pd.DataFrame(columns=['id', 'title', 'Query', 'Tipo', 'Descrição'])

model_to_ranking_dict = {
    "colbert": "colbert",
}

selected_model = "colbert"

assert selected_model in model_to_ranking_dict.keys()

output_name = 'output/Results_' + selected_model + '_extraQuestions' + '.xlsx'

if model_to_ranking_dict[selected_model] is not None:
    i = 0
    for input_query in questions['Query']:
        # save a checkpoint each 100 queries
        if i % 100 == 0:
            data.to_excel(output_name, index=False)

        with app.syncio(connections=1) as session:
            try:
                response: VespaQueryResponse = session.query(
                    yql="select * from sources * where ({targetHits:1000}nearestNeighbor(embedding,q)) limit 5",
                    query=input_query,
                    ranking="colbert_context_level",
                    body={
                        "input.query(q)": f"embed(e5, \"{input_query}\")",
                        "input.query(qt)": f"embed(colbert, \"{input_query}\")",
                        # "input.query(q)": f"embed({input_query})",
                        #"timeout": "30s"  # Aumentar o tempo limite para 10 segundos
                    }
                )
                assert response.is_successful()
            except Exception as e:
                print(f"Error with query '{input_query}': {e}")
                continue

            for hit in response.hits:
                record = {}
                for field in ['id', 'title']:
                    record[field] = hit['fields'].get(field, None)
                record["Query"] = input_query
                record["Tipo"] = questions[questions['Query'] == input_query]['Tipo'].values[0]
                record["Descrição"] = questions[questions['Query'] == input_query]['Descrição'].values[0]
                data = pd.concat([data, pd.DataFrame([record])], ignore_index=True)

        i += 1

    # Sorting
    data = data.sort_values(by=['Tipo', 'Query'])

    # reordering columns
    data = data[['Tipo', 'Descrição', 'Query', 'id', 'title']]

    # exporting to excel
    data.to_excel(output_name, index=False)


Respostas para a query `chocolate`:

In [20]:
from vespa.io import VespaQueryResponse

with app.syncio(connections=1) as session:
    response:VespaQueryResponse = session.query(
        yql="select * from sources * where ({targetHits:1000}nearestNeighbor(embedding,q))",
        ranking="colbert_context_level",
        query="chocolate", 
        body={
            "input.query(q)": f'embed(e5, "chocolate")',
            "input.query(qt)": f'embed(colbert, "chocolate")',
        },
    )

assert(response.is_successful())
for hit in response.hits:
    record = {}
    for field in ['id', 'title', 'body']:
        record[field] = hit['fields'][field]
    print(record)

{'id': 406331, 'title': 'healthified  decadent hot chocolate', 'body': 'healthified  decadent hot chocolate\nRecipe posted on: 2010-01-02\nTags: weeknight, 15-minutes-or-less, time-to-make, course, main-ingredient, preparation, occasion, healthy, beverages, easy, beginner-cook, fall, low-fat, chocolate, dietary, gifts, low-sodium, low-cholesterol, seasonal, low-calorie, comfort-food, inexpensive, healthy-2, low-in-something, taste-mood, presentation, served-hot, 3-steps-or-less\nDescription: a recipe from tablespoon.com that makes a healthier version of hot chocolate. 50% fewer calories, 77% less fat and 20% more calcium than the original recipe.\nThis recipe takes 15 minutes to be done.\nFor this recipe you will need the ingredients: \nsugar, unsweetened baking cocoa, nonfat milk, fat-free half-and-half, semi-sweet chocolate chips, vanilla\nThe 6 steps to make this recipe are: \nmix together the sugar and cocoa in a 2-quart saucepan, stir in the skim milk and half and half until well 

Podemos ver a relevância total e média de cada canto, sendo que:

- `0`:  "recipe_body = recipe['name'] + '\n'";
- `1`:  "Recipe posted on: " + str(recipe['submitted']) + '\n'";
- `2`:  "Tags: " + ', '.join(recipe['tags']) + '\n'";
- `3`:  "Description: " + recipe['description'] + '\n'";
- `4`:  "This recipe takes " + str(recipe['minutes']) + " minutes to be done." + '\n'";
- `5`:  "For this recipe you will need the ingredients: " + '\n'";
- `6`:  ', '.join(recipe['ingredients']) + '\n'";
- `7`:  "The " + str(recipe["n_steps"]) + " steps to make this recipe are: " + '\n'";
- `8`:  ', '.join(recipe['steps'])";

In [21]:
#response.hits[0]
# get the cells of the first hit
#print(response.hits[0]['fields']['matchfeatures']['max_sim_per_context']['cells'])

total = {'0': 0, '1': 0, '2': 0, '3': 0, '4': 0, '5': 0, '6': 0, '7': 0, '8': 0}

for hit in response.hits:
    cells = hit['fields']['matchfeatures']['max_sim_per_context']['cells']
    for key in total.keys():
        total[key] += cells[key]

median = total 
median = {k: v / len(response.hits) for k, v in median.items()}

print(total)
print(median)

{'0': 790.2047920227051, '1': 282.6123237609863, '2': 388.8832656741142, '3': 573.3172216415405, '4': 207.56607818603516, '5': 82.02421569824219, '6': 689.8951778411865, '7': 123.93949127197266, '8': 457.1917953491211}
{'0': 79.02047920227051, '1': 28.261232376098633, '2': 38.888326567411426, '3': 57.33172216415405, '4': 20.756607818603516, '5': 8.202421569824219, '6': 68.98951778411865, '7': 12.393949127197265, '8': 45.71917953491211}
