## Búsqueda semántica

Se llama a la librería Sentence transformers

In [None]:
!pip install sentence_transformers

Se importan las librerías que se van a usar

In [2]:
import sentence_transformers
import numpy as np

Se define el corpus con el que se va a trabajar

In [3]:
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.']

Se define el modelo preentrenado y se llama carga el `embedder` para el cálculo de vectores

In [4]:
model = 'paraphrase-multilingual-MiniLM-L12-v2'

embedder = sentence_transformers.SentenceTransformer(model)

Downloading:   0%|          | 0.00/968 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/471M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/14.8M [00:00<?, ?B/s]

Se le pasa el corpus para representar los textos como vectores

In [20]:
corpus_embeddings = embedder.encode(corpus)

In [None]:
corpus_embeddings[0]

Ya que están representados como vetores, podemos hacer una función que permite hacer una búsqueda semántica para calcular la similitud

In [8]:
def query(embedder, corpus, query, top_k=5):
    query_embedding = embedder.encode([query])
    cos_scores = sentence_transformers.util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    cos_scores = cos_scores.cpu()
    top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]
    print('\nQuery:', query)
    for idx in top_results[0:top_k]:
        print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))

Ahora consultamos al sistema

In [9]:
query(embedder, corpus, 'A man is eating pasta.')


Query: A man is eating pasta.
A man is eating food. (Score: 0.6734)
A man is eating a piece of bread. (Score: 0.4269)
A man is riding a horse. (Score: 0.2086)
A man is riding a white horse on an enclosed ground. (Score: 0.1020)
A cheetah is running behind its prey. (Score: 0.0566)


In [10]:
query(embedder, corpus, 'Someone in a gorilla costume is playing a set of drums.')

query(embedder, corpus, 'A cheetah chases prey on across a field.')


Query: Someone in a gorilla costume is playing a set of drums.
A monkey is playing drums. (Score: 0.8167)
A cheetah is running behind its prey. (Score: 0.2720)
A woman is playing violin. (Score: 0.1721)
A man is riding a horse. (Score: 0.1291)
A man is riding a white horse on an enclosed ground. (Score: 0.1213)

Query: A cheetah chases prey on across a field.
A cheetah is running behind its prey. (Score: 0.9147)
A monkey is playing drums. (Score: 0.2655)
A man is riding a horse. (Score: 0.1933)
A man is riding a white horse on an enclosed ground. (Score: 0.1733)
A man is eating food. (Score: 0.0329)


Ya que usamos un modelo preentrenado multilingue, podemos hacer consultas en español

In [11]:
query(embedder, corpus, 'Una persona comiendo pasta.')

query(embedder, corpus, 'El músico toca la guitarra.')

query(embedder, corpus, 'El buitre atrapa a su presa.')


Query: Una persona comiendo pasta.
A man is eating food. (Score: 0.7202)
A man is eating a piece of bread. (Score: 0.6248)
A man is riding a horse. (Score: 0.1179)
A monkey is playing drums. (Score: 0.1001)
A man is riding a white horse on an enclosed ground. (Score: 0.0589)

Query: El músico toca la guitarra.
A woman is playing violin. (Score: 0.2975)
A monkey is playing drums. (Score: 0.2395)
A man is eating a piece of bread. (Score: 0.0664)
A man is riding a white horse on an enclosed ground. (Score: 0.0584)
A cheetah is running behind its prey. (Score: 0.0552)

Query: El buitre atrapa a su presa.
A cheetah is running behind its prey. (Score: 0.5262)
A monkey is playing drums. (Score: 0.2309)
A man is riding a white horse on an enclosed ground. (Score: 0.1388)
A man is eating a piece of bread. (Score: 0.1280)
A man is eating food. (Score: 0.1161)


## Búsqueda de duplicados

El código siguiente busca aquellos pares de textos con mayor similitud semántica.

In [38]:
paraphrases = sentence_transformers.util.paraphrase_mining(embedder, corpus)

for paraphrase in paraphrases[0:10]:
    score, i, j = paraphrase
    print("{:50}\t<- is like ->\t{:55}\tcon score de:\t{:.4f}".format(corpus[i], corpus[j], score))

A monkey is playing drums.                        	<- is like ->	Un mono haciendo ruido con unos tambores.              	con score de:	0.9317
Un mono haciendo ruido con unos tambores.         	<- is like ->	太鼓で音を出す猿                                               	con score de:	0.8842
A monkey is playing drums.                        	<- is like ->	太鼓で音を出す猿                                               	con score de:	0.8452
A man is riding a horse.                          	<- is like ->	A man is riding a white horse on an enclosed ground.   	con score de:	0.7884
A man is eating food.                             	<- is like ->	A man is eating a piece of bread.                      	con score de:	0.7261
A cheetah is running behind its prey.             	<- is like ->	太鼓で音を出す猿                                               	con score de:	0.3069
A cheetah is running behind its prey.             	<- is like ->	Un mono haciendo ruido con unos tambores.              	con score de:	0.2914
A man 

Añado algunas frases en español y chino al corpus

In [19]:
corpus.append('Un mono haciendo ruido con unos tambores.')
corpus.append('太鼓で音を出す猿')
corpus

['A man is eating food.',
 'A man is eating a piece of bread.',
 'The girl is carrying a baby.',
 'A man is riding a horse.',
 'A woman is playing violin.',
 'Two men pushed carts through the woods.',
 'A man is riding a white horse on an enclosed ground.',
 'A monkey is playing drums.',
 'A cheetah is running behind its prey.',
 'Un mono haciendo ruido con unos tambores.',
 '太鼓で音を出す猿']