<a href="https://colab.research.google.com/github/franciscosanchezoliver/machine_learning_training/blob/main/tema5_recuperacion_informacion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Similitud semántica y búsqueda semántica

En este ejercicio **vamos a usar "Sentence Transformers"**, que es un paquete de python que **permite representar textos mediante vectores de embeddings**, estos vectores se usarán para tareas de similitud semántica.
<br><br>
El **objetivo** de este ejercicio es **practicar con la similitud semántica y la búsqueda semántica basada en vectores de embeddings**.

Instalación del paquete de python

In [2]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence_transformers)
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence_transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m48.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence_transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 k

In [3]:
import sentence_transformers
import numpy as np

In [4]:
# Vamos a necesitar un corpus con los textos de ejemplo.
# Definiremos por tanto una lista de textos de forma manual.
corpus = [
    'A man is eating food.',
    'A man is eating a piece of bread.',
    'The girl is carrying a baby.',
    'A man is riding a horse.',
    'A woman is playing violin.',
    'Two men pushed carts through the woods.',
    'A man is riding a white horse on a enclosed ground.',
    'A monkey is playing drums.',
    'A cheetah is running behind its prey.'
]

In [5]:
# Cargamos un modelo que ya esta entrenado para el cálculo de
# vectores de embedding.
# El paquete de sentence_transformers nos ofrece un gran número de modelos.
# Nosotros en este caso usaremos el modelo: 'paraphrase-multilingual-MiniLM-L12-v2'
# ya que ofrece un buen balance entre velocidad de procesamiento y rendimiento
# para la tarea de similitud semántica.
model = 'paraphrase-multilingual-MiniLM-L12-v2'
embedder = sentence_transformers.SentenceTransformer(model)

Downloading (…)c49cd/.gitattributes:   0%|          | 0.00/968 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)fc6f7c49cd/README.md:   0%|          | 0.00/4.09k [00:00<?, ?B/s]

Downloading (…)6f7c49cd/config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/471M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading unigram.json:   0%|          | 0.00/14.8M [00:00<?, ?B/s]

Downloading (…)f7c49cd/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [6]:
# Una vez que tenemos el modelo instanciado, ya podemos pasar a
# representar cada entrada de nuestro corpus en su representación
# de vector de embedding.
corpus_embeddings =  embedder.encode(corpus)

In [7]:
# Cada entrada de nuestro corpus se habrá transformado en un
# vector de embedding.
# Podemos mostrarlo (solo mostrando una parte de cada vector de
# embedding)
print("CORPUS")
print("------")
for index, each_sentence in enumerate(corpus):
  original_string = corpus[index]
  string_in_embedding = corpus_embeddings[index]

  print(f"{original_string} => {string_in_embedding[:5]}")

CORPUS
------
A man is eating food. => [ 0.08243293 -0.08494214 -0.04580687 -0.15420295 -0.06507353]
A man is eating a piece of bread. => [-0.26802075  0.17192025  0.0407873  -0.32516387 -0.5825334 ]
The girl is carrying a baby. => [-0.23283628  0.14709719 -0.13297518  0.22071306  0.20328966]
A man is riding a horse. => [-0.13889365  0.24662638 -0.3536097   0.15411115 -0.14513905]
A woman is playing violin. => [ 0.2829272  -0.4090246  -0.11486573  0.18298846 -0.52740955]
Two men pushed carts through the woods. => [0.13682659 0.34416133 0.11296383 0.06362064 0.2132838 ]
A man is riding a white horse on a enclosed ground. => [ 0.27220398  0.3953403  -0.21291898  0.223109    0.07229286]
A monkey is playing drums. => [ 0.1886413  -0.31130424  0.0550554  -0.21825454 -0.43666178]
A cheetah is running behind its prey. => [-0.04663849  0.46679556 -0.05566684  0.12200647 -0.07822816]


In [8]:
# Ahora haremos una búsqueda semántica. Para ello:
# 1º Calculamos le vector de embeddings de la consulta (usando el mismo modelo
#     utilizado para el corpus).
#
# 2º Como ya tenemos tanto el corpus en forma de vectores, y también tenemos
#    la consulta en formato de vectores, podemos calcular la distancia entre
#    ellos. Se supone que los vectores con menos distancia con la consulta
#    es lo que tendriamos que devolver.
#
# 3º Para el resultado devolveremos la lista de vectores ordenador de mayor
#    a menos según la distancia con la consulta.

In [9]:
# Esta es la consulta que vamos a hacer.
query = 'A man is eating pasta'

In [10]:
query_in_embeddings = embedder.encode(query)

In [11]:
print("QUERY DONE")
print("----------")

print(f"{query} => {query_in_embeddings[:5]}")

QUERY DONE
----------
A man is eating pasta => [-0.17130324 -0.53461146 -0.18213265  0.09919011 -0.24236323]


In [25]:
# Ahora podemos calcular la distancia que hay entre la query y
# cada uno de los elementos de nuestro corpus.

In [33]:
distance = sentence_transformers.util.pytorch_cos_sim(query_in_embeddings, corpus_embeddings[0])
print(distance[0][0])

tensor(0.6921)


In [34]:
# We can store the distances in an array to later order it
all_distances = []

for index, each_entry_in_corpus in enumerate(corpus_embeddings):
  normal_sentence = corpus[index]
  sentence_in_embedding =  corpus_embeddings[index]

  # To calculate the distance we can use a method given the library
  distance = sentence_transformers.util.pytorch_cos_sim(query_in_embeddings, each_entry_in_corpus)
  distance = distance[0][0]
  all_distances.append((distance, normal_sentence))


# Order by descending order the distances to the query
all_distances.sort(reverse = True)

In [37]:
print(f"Distance with [{query}]")
for each_distance, string_embedded in all_distances:
  print(f"  - [{each_distance} | {string_embedded}]")

Distance with [A man is eating pasta]
  - [0.6921100616455078 | A man is eating food.]
  - [0.44878262281417847 | A man is eating a piece of bread.]
  - [0.20080715417861938 | A man is riding a horse.]
  - [0.09405660629272461 | A man is riding a white horse on a enclosed ground.]
  - [0.04883880168199539 | A cheetah is running behind its prey.]
  - [0.0285110455006361 | A monkey is playing drums.]
  - [0.027330679818987846 | A woman is playing violin.]
  - [-0.10713642835617065 | Two men pushed carts through the woods.]
  - [-0.11389458924531937 | The girl is carrying a baby.]


También podríamos usar esto para detectar duplicados. Para ello tendriamos que establecer un umbral y si sobrepasa por ejemplo el 0.50 de similitud, entonces podriamos considerarlo como un duplicado