# News Articles Recommendations with Embeddings and a Vector Database

This notebook demonstrates how to combine [Sentence Transformers](https://www.sbert.net/) and [Qdrant](https://qdrant.tech/) to make news articles recommendations based on the [NPR dataset](https://www.kaggle.com/datasets/joelpl/news-portal-recommendations-npr-by-globo).


Let's start by importing the necessary libraries and defining key constants:

- **METADATA_FILEPATH**: a path to the `articles.parquet` file from the NPR dataset
- **EMBEDDINGS_FILEPATH**: if available, pre-computed embeddings from this parquet file will be uploaded
- **MODEL_NAME**: the embedding model to be used with Sentence Transformers

In [158]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

METADATA_FILEPATH = "./../data/articles/articles.parquet"
EMBEDDINGS_FILEPATH = "./../data/articles/articles_with_embeddings.parquet"
MODEL_NAME = "neuralmind/bert-base-portuguese-cased"

In [159]:
df = pd.read_parquet(METADATA_FILEPATH)
df.tail()

Unnamed: 0,newsId,url,publishDate,title,titleCharCount,textCharCount,topics,body
148094,58bc855d-87e3-4e3d-afd0-27c7fae77ac1,http://g1.globo.com/go/goias/noticia/2023/03/2...,2023-03-26 14:40:11,Jovem morre após levar choque ao pular muro de...,86,1678,[go],"Gabriel Feitosa da Silva, de 25 anos, morreu a..."
148095,60b204c2-54c0-447b-a751-eda8f32a2bf3,http://g1.globo.com/sc/santa-catarina/noticia/...,2022-06-21 00:18:26,Corregedoria-Geral da Justiça apura conduta de...,104,4453,[sc],A Corregedoria-Geral da Justiça está investiga...
148096,d7eab297-d17a-4634-ba6e-7699500e78bb,http://g1.globo.com/pe/pernambuco/blog/viver-n...,2022-12-19 17:37:11,\nEstudo relata 1ª evidência de acasalamento d...,105,4551,[pe],Registros de marca de cópula em tigres em Noro...
148097,132fb4d8-1182-4655-91cf-c809e3700642,http://g1.globo.com/ciencia-e-saude/noticia/a-...,2018-03-23 18:16:02,A trágica história por trás da 'múmia de extra...,76,2971,[mundo],"O esqueleto da múmia Ata tem 13 centímetros, m..."
148098,9ff4179c-d58c-4040-920b-eb4c7dbd031b,http://g1.globo.com/ciencia/noticia/2023/03/14...,2023-03-14 11:06:08,Sonda da Nasa flagra dunas de areia 'atípicas'...,55,2059,[ciencia],As dunas de areia circulares vista em uma imag...


# Preprocessing


## Sampling the dataset
The NPR dataset has a considerable high volume of news articles. So, in order to demonstrate the concepts, we'll sample the dataset.

Let's take only the news articles published in the last 7 days of data:

In [160]:
# Sampling
max_days = 7
df = df[df["publishDate"] >= df["publishDate"].max() + pd.DateOffset(-max_days)]

## Transforming Metadata

Some columns need to be transformed in order to upload them in Qdrant. Therefore, we'll make the following transformations:
- Convert `publishDate` to `publishTimestamp` so we can get a integer representation of the publication timestamp
- Convert `topics` to a *list* instead of a *numpy.ndarray*

In [161]:
# Dtype conversion
df["publishTimestamp"] = df["publishDate"].apply(pd.Timestamp.timestamp)
df.drop(columns=["publishDate"], inplace=True)
df["topics"] = df["topics"].apply(lambda x: x.tolist())

## Generating Sentence

NPR articles has several textual-based data like the article's **title** and **body** content. Let's concatenate both string so we can work with a single textual sentence:

In [162]:
def generate_item_sentence(item: pd.Series) -> str:
    return ' '.join([item["title"], item["body"]])

df["sentence"] = df.apply(generate_item_sentence, axis=1)
df.tail()

Unnamed: 0,newsId,url,title,titleCharCount,textCharCount,topics,body,publishTimestamp,sentence
147774,955f0e20-6820-4ddc-b0fe-9228c3f00bef,http://g1.globo.com/mg/grande-minas/noticia/20...,"Homem morre com cinco tiros na cabeça, em Pai ...",51,997,[mg],"Um homem, de 36 anos, morreu a tiros na tarde ...",1682356000.0,"Homem morre com cinco tiros na cabeça, em Pai ..."
147877,16022899-4880-4505-8b46-db26f016bf00,http://g1.globo.com/politica/blog/andreia-sadi...,CPMI dos atos golpistas: oposição vai mirar em...,80,1520,[política],"General Braga Netto e o ministro da Justiça, F...",1682343000.0,CPMI dos atos golpistas: oposição vai mirar em...
147992,7f9ad39d-09a1-4764-b3f4-59b9c5f2b6f8,http://g1.globo.com/pe/pernambuco/noticia/2023...,Chuva no Grande Recife deixa ruas alagadas e t...,83,3838,[pe],Ruas ficam alagadas após chuvas no Recife\nRua...,1682507000.0,Chuva no Grande Recife deixa ruas alagadas e t...
148007,41350c69-0b8a-4f9c-9eac-e6242e6fc0b0,http://g1.globo.com/jornal-nacional/noticia/20...,Venda de carros usados no país passa a ser cin...,107,2517,[jornal-nacional],Venda de carros usados no país passa a ser cin...,1682814000.0,Venda de carros usados no país passa a ser cin...
148054,6ceee695-f2eb-4541-be38-a7782bd4fdcc,http://g1.globo.com/ba/bahia/noticia/2023/04/2...,Covid-19: veja estratégia de vacinação que inc...,116,14240,[bahia],Vacinação contra Covid-19 segue em Salvador\nB...,1682467000.0,Covid-19: veja estratégia de vacinação que inc...


## Generating Embeddings

So once we define the textual features from our input data, we need to establish an embedding model to generate our numerical representation. Lucky for us, there are websites like [HuggingFace](https://huggingface.co/) where you can look for pre-trained models suitable for specific languages or tasks. In our example, we can use the `neuralmind/bert-base-portuguese-cased` model, which was trained in Brazilian portuguese for the following tasks:

- Named Entity Recognition
- Sentence Textual Similarity
- Recognizing Textual Entailment

Let's create an encoder object based on the `SentenceTransformer` class

In [163]:
encoder = SentenceTransformer(model_name_or_path=MODEL_NAME, device='cpu')

No sentence-transformers model found with name /Users/joao.guedes/.cache/torch/sentence_transformers/neuralmind_bert-base-portuguese-cased. Creating a new one with MEAN pooling.


We can take an example news article and try to generate an embedding from its textual sentence:

In [164]:
sample_sentence = df.query("newsId == '4159a128-f6c0-44a7-b08f-0a17f099d43f'")["sentence"].values[0]
sample_embedding = encoder.encode(sample_sentence)
print (sample_sentence)
print (sample_embedding[:4])


Paraguaios vão às urnas neste domingo (30) para escolher novo presidente Eleição no Paraguai tem disputa acirrada pela Presidência
Os paraguaios vão às urnas neste domingo (30) para escolher o novo presidente do país.
As pesquisas de opinião indicam uma disputa apertada entre o candidato governista Santiago Peña, do Partido Colorado, e o oposicionista Efraín Alegre, do Concertación Nacional, de esquerda. 
A votação é em turno único, e também vai eleger parlamentares e governadores. Dois temas dominaram as discussões na campanha: a corrupção, e a ampliação das relações comerciais com a China.
[-0.2875876   0.0356041   0.31462672  0.06252239]


No we need to repeat the embedding generation process for all articles in the NPR dataset. However, this step can take a while depending on your machine's processing power (GPU's are very welcome here). Therefore, if you have a pre-computed embeddings stored in a file, it is a good strategy to upload them and only generate embeddings for new articles.

In [165]:
if os.path.isfile(EMBEDDINGS_FILEPATH):
    print ("Getting pre-computed embeddings")
    df_embeddings = pd.read_parquet(EMBEDDINGS_FILEPATH, columns=["newsId", "sentence_embedding"])
    df = pd.merge(df, df_embeddings, on="newsId", how="left")
else:
    df["sentence_embedding"] = pd.nan

df.tail()

Getting pre-computed embeddings


Unnamed: 0,newsId,url,title,titleCharCount,textCharCount,topics,body,publishTimestamp,sentence,sentence_embedding
3553,955f0e20-6820-4ddc-b0fe-9228c3f00bef,http://g1.globo.com/mg/grande-minas/noticia/20...,"Homem morre com cinco tiros na cabeça, em Pai ...",51,997,[mg],"Um homem, de 36 anos, morreu a tiros na tarde ...",1682356000.0,"Homem morre com cinco tiros na cabeça, em Pai ...","[-0.20603772, -0.19998942, 0.33290493, 0.07387..."
3554,16022899-4880-4505-8b46-db26f016bf00,http://g1.globo.com/politica/blog/andreia-sadi...,CPMI dos atos golpistas: oposição vai mirar em...,80,1520,[política],"General Braga Netto e o ministro da Justiça, F...",1682343000.0,CPMI dos atos golpistas: oposição vai mirar em...,"[-0.15523164, 0.028003749, 0.43661755, -0.0520..."
3555,7f9ad39d-09a1-4764-b3f4-59b9c5f2b6f8,http://g1.globo.com/pe/pernambuco/noticia/2023...,Chuva no Grande Recife deixa ruas alagadas e t...,83,3838,[pe],Ruas ficam alagadas após chuvas no Recife\nRua...,1682507000.0,Chuva no Grande Recife deixa ruas alagadas e t...,"[-0.2497541, -0.026237434, 0.60738844, -0.0385..."
3556,41350c69-0b8a-4f9c-9eac-e6242e6fc0b0,http://g1.globo.com/jornal-nacional/noticia/20...,Venda de carros usados no país passa a ser cin...,107,2517,[jornal-nacional],Venda de carros usados no país passa a ser cin...,1682814000.0,Venda de carros usados no país passa a ser cin...,"[-0.2967772, -0.18254128, 0.47457638, -0.01638..."
3557,6ceee695-f2eb-4541-be38-a7782bd4fdcc,http://g1.globo.com/ba/bahia/noticia/2023/04/2...,Covid-19: veja estratégia de vacinação que inc...,116,14240,[bahia],Vacinação contra Covid-19 segue em Salvador\nB...,1682467000.0,Covid-19: veja estratégia de vacinação que inc...,"[-0.1728873, -0.020331837, 0.6433897, -0.01650..."


In [166]:
%%time
def get_item_embedding(item: pd.Series) -> np.ndarray:
    """Generates item embedding if it is null"""
    if isinstance(item["sentence_embedding"], np.ndarray):
        return item["sentence_embedding"]
    return encoder.encode(item["sentence"])

df["sentence_embedding"] = df.apply(get_item_embedding, axis=1)

df.tail()

CPU times: user 48.5 ms, sys: 3.9 ms, total: 52.4 ms
Wall time: 51.7 ms


Unnamed: 0,newsId,url,title,titleCharCount,textCharCount,topics,body,publishTimestamp,sentence,sentence_embedding
3553,955f0e20-6820-4ddc-b0fe-9228c3f00bef,http://g1.globo.com/mg/grande-minas/noticia/20...,"Homem morre com cinco tiros na cabeça, em Pai ...",51,997,[mg],"Um homem, de 36 anos, morreu a tiros na tarde ...",1682356000.0,"Homem morre com cinco tiros na cabeça, em Pai ...","[-0.20603772, -0.19998942, 0.33290493, 0.07387..."
3554,16022899-4880-4505-8b46-db26f016bf00,http://g1.globo.com/politica/blog/andreia-sadi...,CPMI dos atos golpistas: oposição vai mirar em...,80,1520,[política],"General Braga Netto e o ministro da Justiça, F...",1682343000.0,CPMI dos atos golpistas: oposição vai mirar em...,"[-0.15523164, 0.028003749, 0.43661755, -0.0520..."
3555,7f9ad39d-09a1-4764-b3f4-59b9c5f2b6f8,http://g1.globo.com/pe/pernambuco/noticia/2023...,Chuva no Grande Recife deixa ruas alagadas e t...,83,3838,[pe],Ruas ficam alagadas após chuvas no Recife\nRua...,1682507000.0,Chuva no Grande Recife deixa ruas alagadas e t...,"[-0.2497541, -0.026237434, 0.60738844, -0.0385..."
3556,41350c69-0b8a-4f9c-9eac-e6242e6fc0b0,http://g1.globo.com/jornal-nacional/noticia/20...,Venda de carros usados no país passa a ser cin...,107,2517,[jornal-nacional],Venda de carros usados no país passa a ser cin...,1682814000.0,Venda de carros usados no país passa a ser cin...,"[-0.2967772, -0.18254128, 0.47457638, -0.01638..."
3557,6ceee695-f2eb-4541-be38-a7782bd4fdcc,http://g1.globo.com/ba/bahia/noticia/2023/04/2...,Covid-19: veja estratégia de vacinação que inc...,116,14240,[bahia],Vacinação contra Covid-19 segue em Salvador\nB...,1682467000.0,Covid-19: veja estratégia de vacinação que inc...,"[-0.1728873, -0.020331837, 0.6433897, -0.01650..."


Since generating embeddings for all articles is an expensive process, let's save our results on a local file before we upload them in the vector database

In [167]:
df[["newsId", "sentence_embedding"]].to_parquet("./../data/articles_with_embeddings.parquet", index=None)

# Qdrant Setup

To deal with all Qdrant operations, we need to create a client object that points out to a vector database. Here, you can either:

1. Create a Qdrant data storage on a local disk, or
2. Connect to a Qdrant server

> Note: For the Qdrant server option, you can [create a free-tier cloud storage using the Qdrant Cloud](https://qdrant.tech/documentation/cloud/) to test your operations or you can [initialize a local server using Docker](https://qdrant.tech/documentation/quick-start/).



In [168]:
# client = QdrantClient(path="./../qdrant_data") # Persists data on a local disk
client = QdrantClient(host="localhost", port=6333) # Connects to a Qdrant server

To check the existing collections on the connecter Qdrant database:

In [169]:
client.get_collections()

CollectionsResponse(collections=[CollectionDescription(name='news-articles')])

## Creating a collection

A [collection](https://qdrant.tech/documentation/concepts/collections/) is a named set of points (vectors with a payload) among which you can search. The vector of each point within the same collection must have the same dimensionality and be compared by a single metric, so we need to define the vector configuration when creating a collection:

In [170]:
from qdrant_client.http.models import Distance, VectorParams

collection_name = "news-articles"
existing_collections = [collection.name for collection in client.get_collections().collections]

if collection_name not in existing_collections:
    client.create_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(
            size=encoder.get_sentence_embedding_dimension(),
            distance=models.Distance.COSINE,
        ),
    )

Checking the collection creation:

In [171]:
client.get_collections()

CollectionsResponse(collections=[CollectionDescription(name='news-articles')])

## Generating Vectors Points

Prior to finally populating the database, we need to create proper objects to be uploaded. In Qdrant, vectors can be stored using a [PointStruct](https://qdrant.tech/documentation/concepts/points/) class, which you can use to define the following properties:
- **id**: the vector's ID (in the NPR case, is the newsId)
- **vector**: a 1-dimensional array representing the vector (generated by the embedding model)
- **payload**: a dictionary containing any other relevant metadata that can later be used to query vectors in a collection

In [172]:
from qdrant_client.http.models import PointStruct

metadata_columns = df.drop(["newsId", "sentence", "sentence_embedding"], axis=1).columns

def create_vector_point(row:pd.Series) -> PointStruct:
    return PointStruct(
        id=row["newsId"],
        vector=row["sentence_embedding"].tolist(),
        payload={
            field: row[field]
            for field in metadata_columns
            if (str(row[field]) not in ['None', 'nan'])
        }
    )

points = df.apply(create_vector_point, axis=1).tolist()


## Uploading Vectors

Finally, after all items are turned into point structures, we can upload them in chunks to the database:

In [173]:
CHUNK_SIZE = 500
n_chunks = np.ceil(len(points)/CHUNK_SIZE)

for i, points_chunk in enumerate(np.array_split(points, n_chunks)):
    print (f"Processing chunk {i+1}/{int(n_chunks)}")
    operation_info = client.upsert(
        collection_name=collection_name,
        wait=True,
        points=points_chunk.tolist()
    )

Processing chunk 1/8
Processing chunk 2/8
Processing chunk 3/8
Processing chunk 4/8
Processing chunk 5/8
Processing chunk 6/8
Processing chunk 7/8
Processing chunk 8/8


Checking the number of vectors in a collection:

In [174]:
client.get_collection(collection_name).vectors_count

3558

# Querying Vectors

Now that collections are finally populated with vectors, we can start querying the database. There are many ways we can input information to query the database:

## Scroll

The [scroll](https://qdrant.tech/documentation/concepts/points/) operation let's you get all stored points without knowing ids, or iterate over points that correspond to a filter.

> Note 1: for most operations, you can define filtering parameters which resembles the _WHERE_ clauses in a relational database. If you want to check how to explore these filter creation, check the [filtering documentation](https://qdrant.tech/documentation/concepts/filtering/).

> Note 2: the [payload](https://qdrant.tech/documentation/concepts/filtering/) parameter let's you tell Qdrant which additional information you want to retrieve from your query (think of if as the _SELECT_ columns in a relational database).

For instance, let's query for news articles with topics like `política` (_politics_) and `mundo` (_world_):

In [175]:
from qdrant_client.models import Filter
from qdrant_client.http import models

client.scroll(
    collection_name=collection_name,
    limit=5,
    with_payload=["newsId", "title", "topics"],
    with_vectors=False,
    scroll_filter=Filter(
        must=[
            models.FieldCondition(key="topics", match=models.MatchAny(any=["política", "mundo"])),
        ]
    )
)

([Record(id='004b22c3-87f5-4270-b6bb-4ddfc8490f2a', payload={'title': 'Lira manda recado após veto de indicado para o Tribunal Regional Federal da 1ª Região', 'topics': ['política']}, vector=None),
  Record(id='007fa598-ce88-42a5-b63b-c6810614e663', payload={'title': 'Moraes vota para tornar réus 200 denunciados de incitar e executar atos golpistas de 8 de janeiro', 'topics': ['política']}, vector=None),
  Record(id='00b89b4c-d912-4394-84de-97a94abe8ca9', payload={'title': 'Terremoto de magnitude 5,7 atinge costa de Valparaíso, no Chile', 'topics': ['mundo']}, vector=None),
  Record(id='0114fb71-893f-4a84-839b-5a9b3ddd2b38', payload={'title': 'Projetos do governo passam a travar pauta da Câmara na próxima semana; veja quais', 'topics': ['política']}, vector=None),
  Record(id='0152157d-a7cc-48be-9bda-a5c2b0ae9207', payload={'title': 'Mortos em seita de jejum no Quênia chegam a 73', 'topics': ['mundo']}, vector=None)],
 '01fc725c-c544-460b-a818-04b415497fca')

## Search API

Searching for the nearest vectors is at the core of many representational learning applications. The key idea is to transform objects into vectors so that objects close in the real world appear close in vector space.

Qdrant offers the [Search API](https://qdrant.tech/documentation/concepts/search/) so we can easily extract vectors close to an input vector. We can use this to create the core functionality of a **search engine** using the same embedding model that was used to generate the vectors in the collection.

In [176]:
# 
from qdrant_client.models import Filter
from qdrant_client.http import models

query_text = "Donald Trump"
query_vector = encoder.encode(query_text).tolist()

client.search(
    collection_name=collection_name,
    query_vector=query_vector,
    limit=10,
    with_payload=["newsId", "title", "topics"],
    with_vectors=False,
    score_threshold=0,
    query_filter=None
)

[ScoredPoint(id='4159a128-f6c0-44a7-b08f-0a17f099d43f', version=15, score=0.3904202, payload={'title': 'Paraguaios vão às urnas neste domingo (30) para escolher novo presidente', 'topics': ['jornal-nacional']}, vector=None),
 ScoredPoint(id='e4db00c9-7bcb-4c88-a502-4c6c840dde30', version=10, score=0.3860359, payload={'title': 'Eleitores dizem que Biden e Trump não deveriam concorrer em 2024, mostra pesquisa Reuters/Ipsos', 'topics': ['mundo']}, vector=None),
 ScoredPoint(id='8efe078b-9f42-446e-a970-9abfe408bb84', version=12, score=0.37755477, payload={'title': 'Escritora acusa Trump de ter abusado sexualmente dela nos anos 1990', 'topics': ['jornal-nacional']}, vector=None),
 ScoredPoint(id='d2f89819-e36d-4e23-a13c-47c582fe1363', version=14, score=0.3773188, payload={'title': 'Mike Pence, ex-vice de Donald Trump, presta depoimento na Justiça que pode complicar o ex-presidente', 'topics': ['mundo']}, vector=None),
 ScoredPoint(id='4619ee33-07b6-4121-8724-cdc4905bd962', version=10, score

Since we are using a news article dataset, recency is a key factor when delivering news content. We can use the `query_filter` parameter to obtain only articles that were published after a defined timestamp:

In [177]:
query_text = "Donald Trump"
query_vector = encoder.encode(query_text).tolist()

client.search(
    collection_name=collection_name,
    query_vector=query_vector,
    limit=10,
    with_payload=["newsId", "title", "topics", "publishTimestamp"],
    with_vectors=False,
    score_threshold=0,
    query_filter=Filter(
        must=[
            models.FieldCondition(
                key="publishTimestamp", range=models.Range(gt = 1682841633.0)
            ),
            models.FieldCondition(
                key="topics", match=models.MatchAny(any=["mundo"])
            )
        ],
    )
)

[ScoredPoint(id='f9408609-146f-44b1-9fb3-242385193a9b', version=13, score=0.3456782, payload={'publishTimestamp': 1682897818.0, 'title': 'Santiago Peña, de 44 anos, vence eleições no Paraguai', 'topics': ['mundo']}, vector=None),
 ScoredPoint(id='35831976-0e90-4099-ba6d-abfe5884dacd', version=11, score=0.34471738, payload={'publishTimestamp': 1682879121.0, 'title': "Biden ataca meios de comunicação por 'mentiras conspiratórias e má fé'", 'topics': ['mundo']}, vector=None),
 ScoredPoint(id='e9a0473c-c7bd-412c-b01d-512f1fd0c9f4', version=13, score=0.33558688, payload={'publishTimestamp': 1682887948.0, 'title': 'Paraguai começa a apuração de votos em eleições presidenciais concorridas', 'topics': ['mundo']}, vector=None),
 ScoredPoint(id='5e4d9c3f-6446-4f5a-a8a2-58f6d2cf5ad2', version=11, score=0.31639937, payload={'publishTimestamp': 1682898877.0, 'title': 'Conheça o economista Santiago Peña, de 44 anos, o novo presidente do Paraguai', 'topics': ['mundo']}, vector=None),
 ScoredPoint(id=

## Recommendation API

We can ask the vector database to "recommend" items that are closer to some desired vector IDs but far from undesired vector IDs using the [Recommendation API](https://qdrant.tech/documentation/concepts/search/). The desired and undesired IDs are called `positive` and `negative` examples, respectively, and they are thought of as seeds to the recommendation:

In [178]:

client.recommend(
    collection_name=collection_name,
    positive=['f438551d-3ce9-4b30-8d3f-6fdea841e63c'],
    negative=None,
    limit = 10,
    with_payload=["newsId", "title", "topics", "publishTimestamp"],
    with_vectors=False,
    score_threshold=0,
    query_filter=None
)

[ScoredPoint(id='b218b2a3-e6a8-4624-ab6f-d826f887a7ae', version=10, score=0.9658023, payload={'publishTimestamp': 1682721673.0, 'title': 'Banco do Brasil vai retirar patrocínio de evento agropecuário após ministro ser desconvidado, diz governo', 'topics': ['política', 'economia']}, vector=None),
 ScoredPoint(id='e2e6a63c-f27b-4953-b14b-8603058729eb', version=14, score=0.94846636, payload={'publishTimestamp': 1682822241.0, 'title': 'Cerimônia de abertura da Agrishow é cancelada pelos organizadores', 'topics': ['sp']}, vector=None),
 ScoredPoint(id='004b22c3-87f5-4270-b6bb-4ddfc8490f2a', version=13, score=0.93220955, payload={'publishTimestamp': 1682435454.0, 'title': 'Lira manda recado após veto de indicado para o Tribunal Regional Federal da 1ª Região', 'topics': ['política']}, vector=None),
 ScoredPoint(id='8e27b2ed-c834-4b53-9435-ff062c499e56', version=10, score=0.9289038, payload={'publishTimestamp': 1682478089.0, 'title': 'Bolsonaro depõe nesta quarta à Polícia Federal no inquérito

In addition, we can specific some seed ids and ask for items that match a given feature with the seed items. For example, let's say we want to find items that are similar to the following ID:

In [179]:
seed_ids = ["9ed25e49-14f9-44cd-b76e-a104b4dfabca"]

seed_records = client.scroll(
    collection_name=collection_name,
    limit=5,
    with_payload=["newsId", "title", "topics"],
    with_vectors=False,
    scroll_filter=Filter(
        must=[
            models.HasIdCondition(has_id=seed_ids),
        ]
    )
)[0]
seed_records

[Record(id='9ed25e49-14f9-44cd-b76e-a104b4dfabca', payload={'title': 'Saiba por que Joe Biden lançou sua candidatura à reeleição nesta terça-feira', 'topics': ['mundo']}, vector=None)]

We can ask qdrant to recommend items that are similar **and** matches the same `topics` feature from the seed item:

In [180]:
match_key = "topics"

client.recommend(
    collection_name=collection_name,
    positive=seed_ids,
    limit = 10,
    with_payload=["newsId", "title", "topics"],
    with_vectors=False,
    score_threshold=0,
    query_filter=Filter(
        should=[
            models.FieldCondition(key=match_key, match=models.MatchAny(any=record.payload[match_key]))
            for record in seed_records
        ]
    )
)

[ScoredPoint(id='38a1cd6f-aa07-4264-8ca1-4cddf05e7afe', version=11, score=0.9734961, payload={'title': 'Biden anuncia que vai concorrer à reeleição', 'topics': ['mundo']}, vector=None),
 ScoredPoint(id='4619ee33-07b6-4121-8724-cdc4905bd962', version=10, score=0.9709304, payload={'title': 'EUA: os 4 motivos que levaram Biden a se lançar à reeleição', 'topics': ['mundo']}, vector=None),
 ScoredPoint(id='e4db00c9-7bcb-4c88-a502-4c6c840dde30', version=10, score=0.96000206, payload={'title': 'Eleitores dizem que Biden e Trump não deveriam concorrer em 2024, mostra pesquisa Reuters/Ipsos', 'topics': ['mundo']}, vector=None),
 ScoredPoint(id='7f45e444-b220-47e4-933a-ac6dc3074f56', version=9, score=0.95809764, payload={'title': 'A gafe de assessora de Biden que gerou dúvidas sobre eventual segundo governo após eleição', 'topics': ['mundo']}, vector=None),
 ScoredPoint(id='d2f89819-e36d-4e23-a13c-47c582fe1363', version=14, score=0.95291734, payload={'title': 'Mike Pence, ex-vice de Donald Trump