# Retriever and ranker

This notebook present a simple neural search pipeline composed of two retrievers and a ranker.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from cherche import data, rank, retrieve
from sentence_transformers import SentenceTransformer

The first step is to define the corpus on which we will perform the neural search. The towns dataset contains about a hundred documents. Each document has fours attributes, the `id`, the `title` of the article, the `url` and the content of the `article`.

In [3]:
documents = data.load_towns()
documents[:4]

[{'id': 0,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles).'},
 {'id': 1,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Since the 17th century, Paris has been one of Europe's major centres of finance, diplomacy, commerce, fashion, gastronomy, science, and arts."},
 {'id': 2,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.'},
 {'id': 3,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The Paris Region had 

We start by initiating a retriever whose mission will be to quickly filter the documents. This retriever will find documents based on the title and content of the article using the `on` parameter.

In [4]:
retriever = retrieve.TfIdf(key="id", on=["title", "article"], documents=documents, k=30)

We then add a ranker to the pipeline to filter the results according to the semantic similarity between the query and the retrieved documents. 
similarity between the query and the retriever's output documents. The ranker will be based on the content of the article.

In [5]:
ranker = rank.Encoder(
    key="id",
    on=["title", "article"],
    encoder=SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k=3,
)

We initialise the pipeline and ask the retrievers to index the documents and the ranker to pre-compute the document embeddings. This step can take some time if you have a lot of documents. It can be interesting to use a GPU to pre-calculate all the embeddings if you have many documents. The embeddings will be stored in the `encoder.pkl` file.

In [6]:
search = retriever + ranker
search.add(documents)

Ranker embeddings calculation.: 100%|█| 2/2 [00:04<00:


TfIdf retriever
 	 key: id
 	 on: title, article
 	 documents: 105
Encoder ranker
	 key: id
	 on: title, article
	 k: 3
	 similarity: cosine
	 Embeddings pre-computed: 105

Let's call our model to retrieve documents related to football in Paris. The search pipeline provides a similarity score for each document. The documents are sorted in order of relevance, from most similar to least similar.

In [7]:
search("paris football")

[{'id': 20, 'similarity': 0.7220986485481262},
 {'id': 24, 'similarity': 0.5216039419174194},
 {'id': 16, 'similarity': 0.484182745218277}]

The retriever we use is a bit too basic, the word aerospace appears in the corpus but aero does not. We are therefore unable to retrieve relevant documents for the query aero.

In [8]:
search("aero")  # Aerospace

[]

We can improve the retrieval by processing sub-units of words using the `ngram_range` parameter of the `TfidfVectorizer` model. This update to the retriever will reduce its precision but increase the recall.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

retriever = retrieve.TfIdf(
    key="id",
    on=["title", "article"],
    documents=documents,
    k=30,
    tfidf=TfidfVectorizer(ngram_range=(4, 10), analyzer="char_wb", max_df=0.3),
)

search = retriever + ranker
search.add(documents)

Ranker embeddings calculation.: 100%|█| 2/2 [00:04<00:


TfIdf retriever
 	 key: id
 	 on: title, article
 	 documents: 105
Encoder ranker
	 key: id
	 on: title, article
	 k: 3
	 similarity: cosine
	 Embeddings pre-computed: 105

In [10]:
search("paris football")

[{'id': 20, 'similarity': 0.7220986485481262},
 {'id': 24, 'similarity': 0.5216039419174194},
 {'id': 16, 'similarity': 0.484182745218277}]

By treating the characters we have built a retriever with a better recall.

In [11]:
search("aero")  # Aerospace

[{'id': 67, 'similarity': 0.32282117009162903},
 {'id': 29, 'similarity': 0.30668121576309204},
 {'id': 31, 'similarity': 0.2690589427947998}]

Let's map indexes to our documents.

In [12]:
search += documents

In [13]:
search("paris football")

[{'id': 20,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The football club Paris Saint-Germain and the rugby union club Stade Français are based in Paris.',
  'similarity': 0.7220986485481262},
 {'id': 24,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The 1938 and 1998 FIFA World Cups, the 2007 Rugby World Cup, as well as the 1960, 1984 and 2016 UEFA European Championships were also held in the city.',
  'similarity': 0.5216039419174194},
 {'id': 16,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris received 12.',
  'similarity': 0.484182745218277}]

In [14]:
search("aero")  # Aerospace

[{'id': 67,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'It is a central and strategic hub for the aeronautics, military and space sector, home to international companies such as Dassault Aviation, Ariane Group, Safran and Thalès.',
  'similarity': 0.32282117009162903},
 {'id': 29,
  'title': 'Toulouse',
  'url': 'https://en.wikipedia.org/wiki/Toulouse',
  'article': 'Toulouse is the centre of the European aerospace industry, with the headquarters of Airbus (formerly EADS), the SPOT satellite system, ATR and the Aerospace Valley.',
  'similarity': 0.30668121576309204},
 {'id': 31,
  'title': 'Toulouse',
  'url': 'https://en.wikipedia.org/wiki/Toulouse',
  'article': 'Thales Alenia Space, ATR, SAFRAN, Liebherr-Aerospace and Airbus Defence and Space also have a significant presence in Toulouse.',
  'similarity': 0.2690589427947998}]