# Rerankers

Rerankers have been a common component of retrieval pipelines for many years. They allow us to add a final "reranking" step to our retrieval pipelines — like with **R**etrieval **A**ugmented **G**eneration (RAG) — that can be used to dramatically optimize our retrieval pipelines and improve their accuracy.

In the example notebook we'll learn how to create retrieval pipelines with reranking using the [Cohere reranking model](https://txt.cohere.com/rerank/) (which is available for free).

To begin, we setup our prerequisite libraries.

In [None]:
!pip install -qU \
    datasets==2.14.5 \
    openai==1.6.1 \
    pinecone-client==3.1.0 \
    cohere==4.27

# https://www.youtube.com/watch?v=Uh9bYiVrW_s
# RAG But Better: Rerankers with Cohere AI
# Jame Briggs
# https://docs.cohere.com/docs/chat-on-langchain
# https://www.pinecone.io/learn/refine-with-rerank/

## Data Preparation

We start by downloading a dataset that we will encode and store. The dataset [`jamescalam/ai-arxiv-chunked`](https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked) contains scraped data from many popular ArXiv papers centred around LLMs. Including papers from Llama 2, GPTQ, and the GPT-4 technical paper.

In [1]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data

  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

We have 41.5K chunks, where each chunk is roughly the length of 1-2 paragraphs in length. Here is an example of a single record:

In [2]:
data[0]

{'doi': '1910.01108',
 'chunk-id': '0',
 'chunk': 'DistilBERT, a distilled version of BERT: smaller,\nfaster, cheaper and lighter\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\nHugging Face\n{victor,lysandre,julien,thomas}@huggingface.co\nAbstract\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\nWhile most prior work investigated the use of distillation for building task-speciﬁc\nmodels, we leverage knowledge distillation during the pre-training phase and show\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\nof i

Format the data into the format we need, this will contain `id`, `text` (which we will embed), and `metadata`. For this use-case we don't need metadata but it can be useful to include so that if needed in the future we can make use of metadata filtering.

In [3]:
data = data.map(lambda x: {
    "id": f'{x["id"]}-{x["chunk-id"]}',
    "text": x["chunk"],
    "metadata": {
        "title": x["title"],
        "url": x["source"],
        "primary_category": x["primary_category"],
        "published": x["published"],
        "updated": x["updated"],
        "text": x["chunk"],
    }
})
# drop uneeded columns
data = data.remove_columns([
    "title", "summary", "source",
    "authors", "categories", "comment",
    "journal_ref", "primary_category",
    "published", "updated", "references",
    "doi", "chunk-id",
    "chunk"
])
data

Dataset({
    features: ['id', 'text', 'metadata'],
    num_rows: 41584
})

In [4]:
print(data['id'])

['1910.01108-0', '1910.01108-1', '1910.01108-2', '1910.01108-3', '1910.01108-4', '1910.01108-5', '1910.01108-6', '1910.01108-7', '1910.01108-8', '1910.01108-9', '1910.01108-10', '1910.01108-11', '1910.01108-12', '1910.01108-13', '1910.01108-14', '1910.01108-15', '1910.01108-16', '1910.01108-17', '1910.01108-18', '1710.06481-0', '1710.06481-1', '1710.06481-2', '1710.06481-3', '1710.06481-4', '1710.06481-5', '1710.06481-6', '1710.06481-7', '1710.06481-8', '1710.06481-9', '1710.06481-10', '1710.06481-11', '1710.06481-12', '1710.06481-13', '1710.06481-14', '1710.06481-15', '1710.06481-16', '1710.06481-17', '1710.06481-18', '1710.06481-19', '1710.06481-20', '1710.06481-21', '1710.06481-22', '1710.06481-23', '1710.06481-24', '1710.06481-25', '1710.06481-26', '1710.06481-27', '1710.06481-28', '1710.06481-29', '1710.06481-30', '1710.06481-31', '1710.06481-32', '1710.06481-33', '1710.06481-34', '1710.06481-35', '1710.06481-36', '1710.06481-37', '1710.06481-38', '1710.06481-39', '1710.06481-40',

In [5]:
len(data)

41584

In [6]:
data_corte=data[:10]
len(data_corte)

3

In [7]:
for i, item in enumerate(data_corte):
    print(f'Elemento {i}: {item}')

Elemento 0: id
Elemento 1: text
Elemento 2: metadata


In [9]:
data[500]['id']

'2108.10934-101'

In [10]:
i = 0
data_new = []
for item in data:
    #print(f'Elemento {i}: {item}')
    i = i + 1
    data_new.append({"id": item['id'], "text": item['text'], "metadata": item['metadata']})
    if i > 10:
        break

print(data_new)

[{'id': '1910.01108-0', 'text': 'DistilBERT, a distilled version of BERT: smaller,\nfaster, cheaper and lighter\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\nHugging Face\n{victor,lysandre,julien,thomas}@huggingface.co\nAbstract\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\nWhile most prior work investigated the use of distillation for building task-speciﬁc\nmodels, we leverage knowledge distillation during the pre-training phase and show\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\nof its language unders

In [11]:
data_new[0]

{'id': '1910.01108-0',
 'text': 'DistilBERT, a distilled version of BERT: smaller,\nfaster, cheaper and lighter\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\nHugging Face\n{victor,lysandre,julien,thomas}@huggingface.co\nAbstract\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\nWhile most prior work investigated the use of distillation for building task-speciﬁc\nmodels, we leverage knowledge distillation during the pre-training phase and show\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\nof its language unders

In [14]:
len(data_new)

11

We need to define an embedding model to create our embedding vectors for retrieval, for that we will be using OpenAI's text-embedding-ada-002. There is some cost associated with this model, so be aware of that (costs for running this notebook are <$1).

In [15]:
import os
import openai
import getpass  # platform.openai.com
from dotenv import load_dotenv

# get API key from top-right dropdown on OpenAI website
#openai.api_key = os.getenv("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")
current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
app_dir = os.path.join(parent_dir, "")
load_dotenv(os.path.join(app_dir, ".env"))
openai_api_key = os.getenv("OPENAI_API_KEY")
#embed_model = "text-embedding-ada-002"
embed_model = "text-embedding-3-small"


Now we create our vector DB to store our vectors. For this we need to get a [free Pinecone API key](https://app.pinecone.io) — the API key can be found in the "API Keys" button found in the left navbar of the Pinecone dashboard.

In [16]:
# pip install pinecone-client

from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
#api_key = os.getenv("PINECONE_API_KEY") or getpass.getpass()
api_key = os.getenv("PINECONE_API_KEY")
# configure client
pc = Pinecone(api_key=api_key)


Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [17]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'
spec = ServerlessSpec(cloud=cloud, region=region)


Creating an index, we set `dimension` equal to to dimensionality of Ada-002 (`1536`), and use a `metric` also compatible with Ada-002 (this can be either `cosine` or `dotproduct`). We also pass our `spec` to index initialization.

In [35]:
import time

index_name = "rerankers"

existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]
print(existing_indexes)

if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=1536,  # 1536,  # dimensionality of ada 002
        metric='cosine', # 'dotproduct', # euclidean
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)



['langchain-index', 'my-index-2', 'livro-python', 'my-index']


In [36]:
# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [22]:
# https://www.pinecone.io/learn/refine-with-rerank/
# teste simples

from pinecone import Pinecone

pc = Pinecone(api_key)

rerank_results = pc.inference.rerank(
    model="bge-reranker-v2-m3",
    query="Tell me about the tech company known as Apple",
    documents=[
      "Apple is a popular fruit known for its sweetness and crisp texture.",	
      "Apple is known for its innovative products like the iPhone.",
      "Many people enjoy eating apples as a healthy snack.",
      "Apple Inc. has revolutionized the tech industry with its sleek designs and user-friendly interfaces.",
      "An apple a day keeps the doctor away, as the saying goes.",
    ],
    top_n=3,
    return_documents=True,
)

In [23]:
print(rerank_results)

RerankResult(
  model='bge-reranker-v2-m3',
  data=[
    { index=1, score=0.48546246,
      document={text="Apple is known fo..."} },
    { index=3, score=0.33014715,
      document={text="Apple Inc. has re..."} },
    { index=0, score=0.008445627,
      document={text="Apple is a popula..."} }
  ],
  usage={'rerank_units': 1}
)


In [37]:
# https://www.pinecone.io/learn/refine-with-rerank/
# teste simples: fazendo o embedding de uma query

query = "What is DistilBERT?"
embeddings_model = "multilingual-e5-large" # dimensao 1024
x = pc.inference.embed(embeddings_model,inputs=[query],parameters={"input_type": "query"})
print(len(x[0]['values']))

from langchain_cohere import CohereEmbeddings
cohere_embeddings = CohereEmbeddings(cohere_api_key=os.getenv("COHERE_API_KEY"), model="embed-english-v3.0")
y = cohere_embeddings.embed_query(query)
print(len(y))

from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=1536, openai_api_key=openai_api_key)
z = embeddings.embed_query(query)
print(len(z))


1024
1024
1536


In [26]:
print(z)

[-0.005070130806416273, -0.02404046803712845, -0.05696573480963707, 0.002937456825748086, -0.013597607612609863, 0.0010808231309056282, -0.028817657381296158, 0.07339617609977722, -0.06649436056613922, 0.020357782021164894, -0.00015341168909799308, -0.005478960461914539, 0.020473670214414597, -0.041848696768283844, 0.06536123156547546, 0.015245802700519562, 0.014383075758814812, 0.03206254169344902, -0.06464014202356339, 0.06005610153079033, -0.03425154834985733, -0.024710046127438545, 0.007449068129062653, 0.01585099846124649, 0.018233155831694603, 0.03687836229801178, -0.029847780242562294, -0.04380593076348305, 0.016584960743784904, 0.00595861067995429, -0.004831915255635977, -0.04813244193792343, 0.0002776500186882913, -0.05088801681995392, 0.0014775809831917286, 0.04576316103339195, 0.010520119220018387, 0.05158334970474243, -0.0062483325600624084, 0.053823865950107574, -0.06628833711147308, -0.009985743090510368, -0.04975488409399986, 0.00821522157639265, -0.03886134549975395, 0.

In [27]:
z[0]

-0.005070130806416273

In [28]:
len(x[0]['values'])

1024

In [29]:
x[0]['values']

[0.0029659271240234375,
 -0.0194854736328125,
 -0.0279083251953125,
 -0.0277252197265625,
 0.027801513671875,
 -0.0189666748046875,
 -0.019622802734375,
 0.08526611328125,
 0.051361083984375,
 -0.0178375244140625,
 0.022796630859375,
 0.01517486572265625,
 -0.060943603515625,
 -0.037384033203125,
 -0.0305633544921875,
 -0.016998291015625,
 -0.03509521484375,
 0.0162811279296875,
 0.00476837158203125,
 -0.0207061767578125,
 0.0211029052734375,
 0.002544403076171875,
 -0.044647216796875,
 -0.0199127197265625,
 -0.02069091796875,
 -0.016357421875,
 -0.0341796875,
 -0.020111083984375,
 -0.019805908203125,
 -0.0248870849609375,
 -0.006748199462890625,
 0.0084991455078125,
 -0.04779052734375,
 -0.0111541748046875,
 -0.004520416259765625,
 -0.007724761962890625,
 0.01551055908203125,
 0.0236663818359375,
 -0.059844970703125,
 0.0159759521484375,
 -0.0287628173828125,
 0.055328369140625,
 -0.006866455078125,
 -0.024139404296875,
 -0.02655029296875,
 0.0182037353515625,
 -0.00659942626953125,
 

In [30]:
z

[-0.005070130806416273,
 -0.02404046803712845,
 -0.05696573480963707,
 0.002937456825748086,
 -0.013597607612609863,
 0.0010808231309056282,
 -0.028817657381296158,
 0.07339617609977722,
 -0.06649436056613922,
 0.020357782021164894,
 -0.00015341168909799308,
 -0.005478960461914539,
 0.020473670214414597,
 -0.041848696768283844,
 0.06536123156547546,
 0.015245802700519562,
 0.014383075758814812,
 0.03206254169344902,
 -0.06464014202356339,
 0.06005610153079033,
 -0.03425154834985733,
 -0.024710046127438545,
 0.007449068129062653,
 0.01585099846124649,
 0.018233155831694603,
 0.03687836229801178,
 -0.029847780242562294,
 -0.04380593076348305,
 0.016584960743784904,
 0.00595861067995429,
 -0.004831915255635977,
 -0.04813244193792343,
 0.0002776500186882913,
 -0.05088801681995392,
 0.0014775809831917286,
 0.04576316103339195,
 0.010520119220018387,
 0.05158334970474243,
 -0.0062483325600624084,
 0.053823865950107574,
 -0.06628833711147308,
 -0.009985743090510368,
 -0.04975488409399986,
 0.

In [38]:
from pinecone_plugins.inference.core.client.exceptions import PineconeApiException
from langchain_openai import OpenAIEmbeddings

def embed(batch: list[str]) -> list[float]:
    # create embeddings (exponential backoff to avoid RateLimitError)
    for j in range(5):  # max 5 retries
        try:
            embed_model = "multilingual-e5-large"
            res = pc.inference.embed(
                model=embed_model,
                inputs=batch,
                parameters={
                    "input_type": "query"
                }
            )
            passed = True
        except PineconeApiException:
            time.sleep(2**j)  # wait 2^j seconds before retrying
            print("Retrying...")
    if not passed:
        raise RuntimeError("Failed to create embeddings.")
    # get embeddings
    embeds = [x["values"] for x in res.data]
    return embeds

def embed2(batch: list[str]) -> list[float]:
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=1536, openai_api_key=openai_api_key)
    embeds = embeddings.embed_query(batch) 
    return embeds
    
i=0
for chunk in data_new:
    print(f"i={i}")
    i = i + 1
    embedding = embed2(chunk['text']) # teste com embed
    k = len(embedding)
    print(f"{i},{k}: {embedding}")

i=0
1,1536: [-0.016242189332842827, -0.006975580006837845, -0.01976313814520836, -0.01147926039993763, -0.022910289466381073, -0.01044226810336113, -0.00781361386179924, 0.08527450263500214, -0.07471165806055069, 0.046375248581171036, 0.016869207844138145, -0.00917014479637146, -0.041021473705768585, -0.014783165417611599, 0.03776580095291138, 0.006758535280823708, 0.04273371770977974, 0.019992241635918617, -0.02710648812353611, 0.047146961092948914, -0.014457598328590393, -0.014867571182549, 0.008326081559062004, 0.009556001983582973, 0.04044269025325775, 0.0134808961302042, 0.021306568756699562, -0.04213081672787666, 0.040466804057359695, -0.0248878076672554, 0.01076180674135685, -0.04543472081422806, -0.028336409479379654, -0.01818353496491909, 0.012142453342676163, 0.056479889899492264, 0.029301052913069725, 0.02018517069518566, -0.03412427380681038, 0.06554754078388214, -0.01609749160706997, -0.028987543657422066, -0.0451212115585804, 0.03803107887506485, -0.027781739830970764, 0.

In [None]:
batch_size=1
i=0
i_end = min(len(data), i+batch_size)
print(i_end)
batch = data_new[i:i_end]
print(batch[0]['text'])

In [39]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(openai_api_key=openai_api_key, model="text-embedding-3-small")

data_with_embeddings = []
i = 0
for chunk in data_new:
    embedding = embeddings_model.embed_query(chunk['text'])
    data_with_embeddings.append({"text": chunk['text'], "embedding": embedding})
    k = len(embedding)
    print(f"i={i},{k}")
    i = i + 1
    print(chunk['text'])
    

i=0,1536
DistilBERT, a distilled version of BERT: smaller,
faster, cheaper and lighter
Victor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF
Hugging Face
{victor,lysandre,julien,thomas}@huggingface.co
Abstract
As Transfer Learning from large-scale pre-trained models becomes more prevalent
in Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains
challenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.
While most prior work investigated the use of distillation for building task-speciﬁc
models, we leverage knowledge distillation during the pre-training phase and show
that it is possible to reduce the size of a BERT model by 40%, while retaining 97%
of its language understanding capabilities and being 60% f

In [40]:
from tqdm.auto import tqdm

batch_size = 1  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(data_with_embeddings), batch_size)):
    passed = False
    # find end of batch
    i_end = min(len(data_with_embeddings), i+batch_size)
    # create batch
    batch = data_with_embeddings[i:i_end]
    text_batch = [item["text"] for item in batch]
    ids_batch = [str(n) for n in range(i, i_end)]
    embeds = [item["embedding"] for item in batch]
    # prepare metadata and upsert batch
    meta = [{"text": text_batch} for text_batch in zip(text_batch)]
    for item in meta:
        if isinstance(item.get('text'), tuple):
            item['text'] = str(item['text'])

    to_upsert = zip(ids_batch, embeds, meta)

    try:
        index.upsert(vectors=list(to_upsert))
    except PineconeApiException as e:
        print(f"Erro na chamada da API: {e}")

    #to_upsert = list(zip(batch["id"], embeds, batch["metadata"]))
    #index.upsert(vectors=to_upsert)

100%|██████████| 11/11 [00:03<00:00,  2.98it/s]


In [41]:
len(x[0].values)

1024

In [42]:
len(z)

1536

In [43]:
query_results = index.query(
    vector=z,
    top_k=5,
    include_values=False,
    include_metadata=True
)
print(query_results)

{'matches': [{'id': '0',
              'metadata': {'text': "('DistilBERT, a distilled version of BERT: "
                                   'smaller,\\nfaster, cheaper and '
                                   'lighter\\nVictor SANH, Lysandre DEBUT, '
                                   'Julien CHAUMOND, Thomas WOLF\\nHugging '
                                   'Face\\n{victor,lysandre,julien,thomas}@huggingface.co\\nAbstract\\nAs '
                                   'Transfer Learning from large-scale '
                                   'pre-trained models becomes more '
                                   'prevalent\\nin Natural Language Processing '
                                   '(NLP), operating these large models in '
                                   'on-theedge and/or under constrained '
                                   'computational training or inference '
                                   'budgets remains\\nchallenging. In this '
                                   'w

In [44]:
# Keep in mind to transform data for reranking
documents = [
    {"id": x["id"], "text": x["metadata"]["text"]} 
    for x in query_results["matches"]
]

In [45]:
documents

[{'id': '0',
  'text': "('DistilBERT, a distilled version of BERT: smaller,\\nfaster, cheaper and lighter\\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\\nHugging Face\\n{victor,lysandre,julien,thomas}@huggingface.co\\nAbstract\\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\\nWhile most prior work investigated the use of distillation for building task-speciﬁc\\nmodels, we leverage knowledge distillation during the pre-training phase and show\\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\\nof its language u

In [46]:
reranked_documents = pc.inference.rerank(
    model="bge-reranker-v2-m3",
    query=query,
    documents=documents,
    top_n=3,
    return_documents=True,
)

In [47]:
print(reranked_documents)

RerankResult(
  model='bge-reranker-v2-m3',
  data=[
    { index=0, score=0.90087914,
      document={id="0", text="('DistilBERT, a d..."} },
    { index=1, score=0.7466935,
      document={id="5", text="('and teacher hid..."} },
    { index=4, score=0.58207536,
      document={id="8", text="('examples per ba..."} }
  ],
  usage={'rerank_units': 1}
)


Define embedding function with OpenAI:

In [48]:
query = "What is DistilBERT?"
xq = embed2(query)
print(xq)

[-0.005070130806416273, -0.02404046803712845, -0.05696573480963707, 0.002937456825748086, -0.013597607612609863, 0.0010808231309056282, -0.028817657381296158, 0.07339617609977722, -0.06649436056613922, 0.020357782021164894, -0.00015341168909799308, -0.005478960461914539, 0.020473670214414597, -0.041848696768283844, 0.06536123156547546, 0.015245802700519562, 0.014383075758814812, 0.03206254169344902, -0.06464014202356339, 0.06005610153079033, -0.03425154834985733, -0.024710046127438545, 0.007449068129062653, 0.01585099846124649, 0.018233155831694603, 0.03687836229801178, -0.029847780242562294, -0.04380593076348305, 0.016584960743784904, 0.00595861067995429, -0.004831915255635977, -0.04813244193792343, 0.0002776500186882913, -0.05088801681995392, 0.0014775809831917286, 0.04576316103339195, 0.010520119220018387, 0.05158334970474243, -0.0062483325600624084, 0.053823865950107574, -0.06628833711147308, -0.009985743090510368, -0.04975488409399986, 0.00821522157639265, -0.03886134549975395, 0.

Now let's test retrieval _without_ Cohere's reranking model.

In [49]:
def get_docs(query: str, top_k: int) -> list[str]:
    # encode query
    xq = embed2(query)
    # search pinecone index
    res = index.query(vector=xq, top_k=top_k, include_metadata=True)
    # get doc text
    docs = {x["metadata"]['text']: i for i, x in enumerate(res["matches"])}
    return docs

In [50]:
query = "What is DistilBERT?"
docs = get_docs(query, top_k=5)
#print(docs)
print("\n---\n".join(docs.keys()))

('DistilBERT, a distilled version of BERT: smaller,\nfaster, cheaper and lighter\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\nHugging Face\n{victor,lysandre,julien,thomas}@huggingface.co\nAbstract\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\nWhile most prior work investigated the use of distillation for building task-speciﬁc\nmodels, we leverage knowledge distillation during the pre-training phase and show\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\nof its language understanding capabilities and being 

Good, but can we get better?

## Reranking Responses

We can easily get the responses we need when we include _many_ responses, but this doesn't work well with LLMs. The recall performance for LLMs [decreases as we add more into the context window](https://www.pinecone.io/blog/why-use-retrieval-instead-of-larger-context/) — we call this excessive filling of the context window _"context stuffing"_.

Fortunately reranking offers us a solution that helps us find those records that may not be within the top-3 results, and pull them into a smaller set of results to be given to the LLM.

We will use Cohere's rerank endpoint for this, to use it you will need a [Cohere API key](https://dashboard.cohere.com/api-keys). Once you have your key you use it to create authenticate your Cohere client like so:

In [51]:
# pip install cohere
import cohere

# os.environ["COHERE_API_KEY"] = os.getenv("COHERE_API_KEY") or getpass.getpass()
# init client
# co = cohere.Client(os.environ["COHERE_API_KEY"])
co = cohere.Client(os.getenv("COHERE_API_KEY"))

In [52]:
docs

{"('DistilBERT, a distilled version of BERT: smaller,\\nfaster, cheaper and lighter\\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\\nHugging Face\\n{victor,lysandre,julien,thomas}@huggingface.co\\nAbstract\\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\\nWhile most prior work investigated the use of distillation for building task-speciﬁc\\nmodels, we leverage knowledge distillation during the pre-training phase and show\\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\\nof its language understanding capabilit

In [53]:
len(docs)

5

In [54]:
docs.keys()

dict_keys(["('DistilBERT, a distilled version of BERT: smaller,\\nfaster, cheaper and lighter\\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\\nHugging Face\\n{victor,lysandre,julien,thomas}@huggingface.co\\nAbstract\\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\\nWhile most prior work investigated the use of distillation for building task-speciﬁc\\nmodels, we leverage knowledge distillation during the pre-training phase and show\\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\\nof its language understanding

In [55]:
keys_list = list(docs.keys())
keys_list

["('DistilBERT, a distilled version of BERT: smaller,\\nfaster, cheaper and lighter\\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\\nHugging Face\\n{victor,lysandre,julien,thomas}@huggingface.co\\nAbstract\\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\\nWhile most prior work investigated the use of distillation for building task-speciﬁc\\nmodels, we leverage knowledge distillation during the pre-training phase and show\\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\\nof its language understanding capabilit

Now we can rerank our results with `co.rerank`. Let's try it with our earlier results.

In [58]:
# https://docs.cohere.com/reference/rerank

rerank_docs = co.rerank(
    query=query, documents=keys_list, top_n=4, model="rerank-english-v2.0", return_documents=True
)

In [59]:
rerank_docs



This returns a list of `RerankResult` objects:

In [60]:
type(rerank_docs)

cohere.types.rerank_response.RerankResponse

We access the text content of the docs like so:

In [61]:
rerank_docs.results

[RerankResponseResultsItem(document=RerankResponseResultsItemDocument(text="('DistilBERT, a distilled version of BERT: smaller,\\nfaster, cheaper and lighter\\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\\nHugging Face\\n{victor,lysandre,julien,thomas}@huggingface.co\\nAbstract\\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\\nWhile most prior work investigated the use of distillation for building task-speciﬁc\\nmodels, we leverage knowledge distillation during the pre-training phase and show\\nthat it is possible to reduce the size of a BERT m

The reordered results look like so:

In [62]:
[doc.index for doc in rerank_docs.results]

[0, 1, 2, 4]

Let's write a function to allow us to more easily compare the original results vs. reranked results.

In [63]:
docs = get_docs(query, top_k=5)
print(docs)

{"('DistilBERT, a distilled version of BERT: smaller,\\nfaster, cheaper and lighter\\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\\nHugging Face\\n{victor,lysandre,julien,thomas}@huggingface.co\\nAbstract\\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\\nWhile most prior work investigated the use of distillation for building task-speciﬁc\\nmodels, we leverage knowledge distillation during the pre-training phase and show\\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\\nof its language understanding capabilit

In [64]:
[docs.values()]

[dict_values([0, 1, 2, 3, 4])]

In [65]:
i2doc = {docs[doc]: doc for doc in docs.keys()}
print(i2doc)

{0: "('DistilBERT, a distilled version of BERT: smaller,\\nfaster, cheaper and lighter\\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\\nHugging Face\\n{victor,lysandre,julien,thomas}@huggingface.co\\nAbstract\\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\\nWhile most prior work investigated the use of distillation for building task-speciﬁc\\nmodels, we leverage knowledge distillation during the pre-training phase and show\\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\\nof its language understanding capabi

In [66]:
i2doc[0]

"('DistilBERT, a distilled version of BERT: smaller,\\nfaster, cheaper and lighter\\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\\nHugging Face\\n{victor,lysandre,julien,thomas}@huggingface.co\\nAbstract\\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\\nWhile most prior work investigated the use of distillation for building task-speciﬁc\\nmodels, we leverage knowledge distillation during the pre-training phase and show\\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\\nof its language understanding capabiliti

In [67]:
docs

{"('DistilBERT, a distilled version of BERT: smaller,\\nfaster, cheaper and lighter\\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\\nHugging Face\\n{victor,lysandre,julien,thomas}@huggingface.co\\nAbstract\\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\\nWhile most prior work investigated the use of distillation for building task-speciﬁc\\nmodels, we leverage knowledge distillation during the pre-training phase and show\\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\\nof its language understanding capabilit

In [73]:
texto = list(docs.keys())[4]
print(texto)

('examples per batch) using dynamic masking and without the next sentence prediction objective.\nData and compute power We train DistilBERT on the same corpus as the original BERT model:\na concatenation of English Wikipedia and Toronto Book Corpus [Zhu et al., 2015]. DistilBERT\nwas trained on 8 16GB V100 GPUs for approximately 90 hours. For the sake of comparison, the\nRoBERTa model [Liu et al., 2019] required 1 day of training on 1024 32GB V100.\n4 Experiments\nGeneral Language Understanding We assess the language understanding and generalization capabilities of DistilBERT on the General Language Understanding Evaluation (GLUE) benchmark\n[Wang et al., 2018], a collection of 9 datasets for evaluating natural language understanding systems.\nWe report scores on the development sets for each task by ﬁne-tuning DistilBERT without the use\nof ensembling or multi-tasking scheme for ﬁne-tuning (which are mostly orthogonal to the present\nwork). We compare the results to the baseline provi

In [74]:
keys_list = list(docs.keys())
rerank_docs = co.rerank(
   query=query, documents=keys_list, top_n=4, model="rerank-english-v2.0", return_documents=True
)
print(rerank_docs)



In [75]:
[docs.values()]

[dict_values([0, 1, 2, 3, 4])]

In [None]:
[doc.index for doc in rerank_docs.results]

In [None]:
i=3
print(f"Rerank [{i}]: ", rerank_docs.results[i].document.text)
print('------------------')
print(f"Original [{i}]: ",list(docs.keys())[i])
print('------------------')
x = rerank_docs.results[i]
print(x.document.text)
print('------------------')
print(x.index)

In [76]:
def compare(query: str, top_k: int, top_n: int):
    # first get vec search results
    docs = get_docs(query, top_k=top_k)
    i2doc = {docs[doc]: doc for doc in docs.keys()}
    keys_list = list(docs.keys())
    # rerank
    rerank_docs = co.rerank(
        query=query, documents=keys_list, top_n=top_n, model="rerank-english-v2.0", return_documents=True
    )
    original_docs = []
    reranked_docs = []
    # compare order change
    for i, doc in enumerate(rerank_docs): # 0 id , 1 text, 2 metadata
        print(f"{i}, {doc} \n\n")
        if i == 1: # text
            k = 0
            for x in doc[1]: # 1 text
                x = rerank_docs.results[k]
                texto_rerank = x.document.text
                rerank_i = x.index
                texto_original = list(docs.keys())[k] # ou então i2doc[rerank_i]
                reranked_docs.append(f"[{rerank_i}]\n" + texto_rerank)
                original_docs.append(f"[{rerank_i}]\n" + texto_original)
                k = k + 1
    for orig, rerank in zip(original_docs, reranked_docs):
        print("ORIGINAL:\n"+orig+"\n\nRERANKED:\n"+rerank+"\n\n---\n")

In [77]:
#compare(query, 5, 4)
compare(query, top_k=5, top_n=4)

0, ('id', '97000b6b-1f46-415d-ad97-f5966c0a392d') 


1, ('results', [RerankResponseResultsItem(document=RerankResponseResultsItemDocument(text="('DistilBERT, a distilled version of BERT: smaller,\\nfaster, cheaper and lighter\\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\\nHugging Face\\n{victor,lysandre,julien,thomas}@huggingface.co\\nAbstract\\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\\nWhile most prior work investigated the use of distillation for building task-speciﬁc\\nmodels, we leverage knowledge distillation during the pre-training

Don't forget to delete your index when you're done to save resources!

In [None]:
pc.delete_index(index_name)

---