# Encoder as a retriever

Sometimes the user's query does not match any document, especially for small corpora. It is where neural search becomes very interesting. The encoder can play the role of a spare wheel to find documents when traditional retrievers have not found anything.

In [1]:
from cherche import retrieve, rank, data
from sentence_transformers import SentenceTransformer

Let's load a dummy dataset

In [2]:
documents = data.load_towns()
documents[:2]

[{'id': 0,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles).'},
 {'id': 1,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Since the 17th century, Paris has been one of Europe's major centres of finance, diplomacy, commerce, fashion, gastronomy, science, and arts."}]

First, we will perform a search with a TfIdf to show that the model's ability to retrieve documents may be limited.

In [3]:
retriever = retrieve.TfIdf(key="id", on=["article", "title"], documents=documents, k=10)
retriever

TfIdf retriever
 	 key: id
 	 on: article, title
 	 documents: 105

There is a single document that match the query "food" using default TfIdf.

In [4]:
retriever("food")

[{'id': 96, 'similarity': 0.22241083884569526}]

We can now compare these results with the `retrieve.Encoder` using Sentence Bert. The `add` method takes time because the retriever will compute embeddings for every document. Once this is done, it saves the embeddings in the `all-mpnet-base-v2.pkl` file. It will not be computed twice.

In [5]:
retriever = retrieve.Encoder(
    key="id",
    on=["title", "article"],
    encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k=5,
)

retriever.add(documents=documents)

Embeddings calculation.: 100%|█| 2/2 [00:02<00:00,  1.


Encoder retriever
 	 key: id
 	 on: title, article
 	 documents: 105

As can be seen, the encoder recalls more documents, even if they do not systematically contain the word "food". These documents seem relevant.

In [6]:
retriever("food")

[{'id': 48, 'similarity': 0.6018152295710627},
 {'id': 66, 'similarity': 0.5962205569027038},
 {'id': 96, 'similarity': 0.5876264149330912},
 {'id': 16, 'similarity': 0.5827899889105632},
 {'id': 49, 'similarity': 0.561209893678236}]

In [7]:
(retriever + documents)("food")

[{'id': 48,
  'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': "The city is recognised for its cuisine and gastronomy, as well as historical and architectural landmarks; as such, the districts of Old Lyon, the Fourvière hill, the Presqu'île and the slopes of the Croix-Rousse are inscribed on the UNESCO World Heritage List.",
  'similarity': 0.6018152295710627},
 {'id': 66,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'Bordeaux is also one of the centers of gastronomy and business tourism for the organization of international congresses.',
  'similarity': 0.5962205569027038},
 {'id': 96,
  'title': 'Montreal',
  'url': 'https://en.wikipedia.org/wiki/Montreal',
  'article': 'It remains an important centre of commerce, aerospace, transport, finance, pharmaceuticals, technology, design, education, art, culture, tourism, food, fashion, video game development, film, and world affairs.',
  'similarity': 0.5876264149330912},

We can create a fancy neural search pipeline to benefit from TfIdf precision and Sentence Transformers recall using union operator `|`.

In [8]:
encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode

In [9]:
# Precision pipeline
precision = retrieve.TfIdf(
    key="id", on=["article", "title"], documents=documents, k=30
) + rank.Encoder(key="id", on=["title", "article"], encoder=encoder, k=5)

# Recall pipeline
recall = retrieve.Encoder(key="id", on=["title", "article"], encoder=encoder, k=5)

search = precision | recall

search.add(documents=documents)

Embeddings calculation.: 100%|█| 2/2 [00:02<00:00,  1.


Union Pipeline
-----
TfIdf retriever
 	 key: id
 	 on: article, title
 	 documents: 105
Encoder ranker
	 key: id
	 on: title, article
	 k: 5
	 similarity: cosine
Encoder retriever
 	 key: id
 	 on: title, article
 	 documents: 105
-----

Our pipeline will first propose documents from the `precision` pipeline and then documents proposed by the `recall` pipeline. 

In [10]:
search("food")

[{'id': 96, 'similarity': 0.600159740267104},
 {'id': 48, 'similarity': 0.20318203662619722},
 {'id': 66, 'similarity': 0.20204847355309763},
 {'id': 16, 'similarity': 0.19935298925146186},
 {'id': 49, 'similarity': 0.19509702003503537}]

In [11]:
search += documents
search("food")

[{'id': 96,
  'title': 'Montreal',
  'url': 'https://en.wikipedia.org/wiki/Montreal',
  'article': 'It remains an important centre of commerce, aerospace, transport, finance, pharmaceuticals, technology, design, education, art, culture, tourism, food, fashion, video game development, film, and world affairs.',
  'similarity': 0.600159740267104},
 {'id': 48,
  'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': "The city is recognised for its cuisine and gastronomy, as well as historical and architectural landmarks; as such, the districts of Old Lyon, the Fourvière hill, the Presqu'île and the slopes of the Croix-Rousse are inscribed on the UNESCO World Heritage List.",
  'similarity': 0.20318203662619722},
 {'id': 66,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'Bordeaux is also one of the centers of gastronomy and business tourism for the organization of international congresses.',
  'similarity': 0.20204847355309763}