# Integrating Embeddings with Queries in an Information Retrieval System

## Objective

In this exercise, we will learn how to integrate embeddings with a query to enhance an Information Retrieval (IR) system. We will use both static and contextual embeddings to generate representations of queries and documents, compute their similarities, and rank the documents based on relevance to the query.

---

## Stages Covered

1. **Introduction to Pre-trained Transformer Models**
   - Load and use BERT for contextual embeddings.
   - Load and use Word2Vec for static embeddings.

2. **Generating Text Embeddings**
   - Generate embeddings for queries and documents using BERT.
   - Generate embeddings for queries and documents using Word2Vec.

3. **Computing Similarity Between Embeddings**
   - Compute cosine similarity between query and document embeddings.
   - Rank documents based on similarity scores.

4. **Integrating Embeddings with Queries**
   - Practical implementation of embedding-based retrieval for a given text corpus.

---

## Prerequisites

- TensorFlow
- Hugging Face's Transformers library
- Gensim library
- Scikit-learn library
- A text corpus in the `../data` folder

---

## Exercise

Follow the steps below to integrate embeddings with a query and enhance your IR system.



Step 0: Verify requirements:

* tensorflow
* transformers
* scikit-learn
* matplotlib
* seaborn

Step 1: Download dataset from Kaggle

URL: https://www.kaggle.com/datasets/zynicide/wine-reviews

In [17]:
#import kaggle
import pandas as pd
#kaggle.api.dataset_download_cli(dataset='zynicide/wine-reviews')

wine_df = pd.read_csv('data/winemag-data_first150k.csv')
#print(wine_df.head())
corpus = wine_df[:50]
print(corpus.head())

   Unnamed: 0 country                                        description  \
0           0      US  This tremendous 100% varietal wine hails from ...   
1           1   Spain  Ripe aromas of fig, blackberry and cassis are ...   
2           2      US  Mac Watson honors the memory of a wine once ma...   
3           3      US  This spent 20 months in 30% new French oak, an...   
4           4  France  This is the top wine from La Bégude, named aft...   

                            designation  points  price        province  \
0                     Martha's Vineyard      96  235.0      California   
1  Carodorum Selección Especial Reserva      96  110.0  Northern Spain   
2         Special Selected Late Harvest      96   90.0      California   
3                               Reserve      96   65.0          Oregon   
4                            La Brûlade      95   66.0        Provence   

            region_1           region_2             variety  \
0        Napa Valley               

Step 2: Load a Pre-trained Transformer Model

Use the BERT model for generating contextual embeddings and Word2Vec for static embeddings.

In [2]:
from gensim.models import KeyedVectors

# Ruta al archivo descargado
model_path = 'data/GoogleNews-vectors-negative300.bin.gz'
# Cargar el modelo Word2Vec preentrenado
word2vec_model = KeyedVectors.load_word2vec_format(model_path, binary=True)

In [3]:
from transformers import BertTokenizer, TFBertModel

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on,

Step 3: Generate Text Embeddings

Static Embeddings with Word2Vec

In [4]:
import numpy as np

def generate_word2vec_embeddings(texts):
    embeddings = []
    for text in texts:
        tokens = text.lower().split()
        word_vectors = [word2vec_model[word] for word in tokens if word in word2vec_model]
        if word_vectors:
            embeddings.append(np.mean(word_vectors, axis=0))
        else:
            embeddings.append(np.zeros(word2vec_model.vector_size))
    return np.array(embeddings)

word2vec_embeddings = generate_word2vec_embeddings(wine_df[:10])
#print("Word2Vec Embeddings:", word2vec_embeddings)

Contextual Embeddings with BERT

In [5]:
def generate_bert_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors='tf', padding=True, truncation=True)
        outputs = bert_model(**inputs)
        embeddings.append(outputs.last_hidden_state[:, 0, :])  # Use [CLS] token representation
    return np.array(embeddings).transpose(0,2,1)

bert_embeddings = generate_bert_embeddings(wine_df[:10])
#print("BERT Embeddings:", bert_embeddings)

Step 4: Compute Similarity Between Embeddings

Use the scikit-learn library.

In [8]:
'''
from sklearn.metrics.pairwise import cosine_similarity

# Cosine similarity between Word2Vec embeddings
word2vec_similarity = cosine_similarity(word2vec_embeddings)
print("Word2Vec Cosine Similarity:\n", word2vec_similarity)

# Cosine similarity between BERT embeddings
bert_similarity = cosine_similarity(bert_embeddings.reshape(10,768))
print("BERT Cosine Similarity:\n", bert_similarity)
'''

'\nfrom sklearn.metrics.pairwise import cosine_similarity\n\n# Cosine similarity between Word2Vec embeddings\nword2vec_similarity = cosine_similarity(word2vec_embeddings)\nprint("Word2Vec Cosine Similarity:\n", word2vec_similarity)\n\n# Cosine similarity between BERT embeddings\nbert_similarity = cosine_similarity(bert_embeddings.reshape(10,768))\nprint("BERT Cosine Similarity:\n", bert_similarity)\n'

Step 5: Compare Contextual and Static Embeddings

Analyze and compare the similarity results from both BERT and Word2Vec embeddings.

In [9]:
'''
import matplotlib.pyplot as plt
import seaborn as sns

def plot_similarity_matrix(matrix, title, figsize=(8, 6), annotation=True):
    plt.figure(figsize=figsize)
    sns.heatmap(matrix, annot=annotation, cmap='coolwarm', fmt='.2f')
    plt.title(title)
    plt.show()

plot_similarity_matrix(word2vec_similarity, "Word2Vec Cosine Similarity")
plot_similarity_matrix(bert_similarity, "BERT Cosine Similarity")
'''

'\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\ndef plot_similarity_matrix(matrix, title, figsize=(8, 6), annotation=True):\n    plt.figure(figsize=figsize)\n    sns.heatmap(matrix, annot=annotation, cmap=\'coolwarm\', fmt=\'.2f\')\n    plt.title(title)\n    plt.show()\n\nplot_similarity_matrix(word2vec_similarity, "Word2Vec Cosine Similarity")\nplot_similarity_matrix(bert_similarity, "BERT Cosine Similarity")\n'

Step 6: Applying to Corpus

In [10]:
'''
# Generate embeddings for the corpus
corpus_word2vec_embeddings = generate_word2vec_embeddings(corpus[:500])
corpus_bert_embeddings = generate_bert_embeddings(corpus[:500])

# Compute similarity for the corpus
corpus_word2vec_similarity = cosine_similarity(corpus_word2vec_embeddings)
corpus_bert_similarity = cosine_similarity(corpus_bert_embeddings.reshape(corpus_bert_embeddings.shape[:2]))

# Plot similarity matrices
plot_similarity_matrix(corpus_word2vec_similarity, "Corpus Word2Vec Cosine Similarity", figsize=(16, 12), annotation=False)
plot_similarity_matrix(corpus_bert_similarity, "Corpus BERT Cosine Similarity", figsize=(16, 12), annotation=False)
'''

'\n# Generate embeddings for the corpus\ncorpus_word2vec_embeddings = generate_word2vec_embeddings(corpus[:500])\ncorpus_bert_embeddings = generate_bert_embeddings(corpus[:500])\n\n# Compute similarity for the corpus\ncorpus_word2vec_similarity = cosine_similarity(corpus_word2vec_embeddings)\ncorpus_bert_similarity = cosine_similarity(corpus_bert_embeddings.reshape(corpus_bert_embeddings.shape[:2]))\n\n# Plot similarity matrices\nplot_similarity_matrix(corpus_word2vec_similarity, "Corpus Word2Vec Cosine Similarity", figsize=(16, 12), annotation=False)\nplot_similarity_matrix(corpus_bert_similarity, "Corpus BERT Cosine Similarity", figsize=(16, 12), annotation=False)\n'

### Programacion Parelela


In [6]:
import multiprocessing
#Funcion paralela para word2vec
def generate_word2vec_embeddings_parallel(text):
    num_nucleos = multiprocessing.cpu_count()
    pool = multiprocessing.Pool(processes=num_nucleos)

    embeddings = pool.map(generate_word2vec_embeddings, text)
    pool.close()
    pool.join()
    return np.array(embeddings)

In [11]:
corpus_word2Vec = generate_word2vec_embeddings_parallel(wine_df[:1])
print("Word Embeddings:", corpus_word2Vec)

In [7]:
import multiprocessing
#Funcion paralela para bert
def generate_bert_embeddings_parallel(texts):
    num_nucleos = multiprocessing.cpu_count()
    pool = multiprocessing.Pool(processes=num_nucleos)

    embeddings = pool.map(generate_bert_embeddings, texts)
    pool.close()
    pool.join()
    return np.array(embeddings)

In [13]:
corpus_bert = generate_bert_embeddings_parallel(wine_df[:10])

Summary

So far, in this exercise, you learned how to:

* Load a pre-trained transformer model (BERT) and a static embedding model (Word2Vec).
* Generate text embeddings using these models.
* Compute cosine similarity between embeddings.
* Compare the similarity results from both contextual and static embeddings.

Now you have a practical understanding of how transformers and embeddings can be used in Information Retrieval systems.

Let's integrate query search.

Step 7: Generate Embeddings for the Query

Generate embeddings for the query using the same model used for the documents.

In [13]:
query = ['computer science']
#query_word2vec_embeddings = generate_word2vec_embeddings([query])
query_bert_embeddings = generate_bert_embeddings([query])

Step 8: Compute Similarity Between Query and Documents

Compute the similarity between the query embedding and each document embedding.

In [15]:
from sklearn.metrics.pairwise import cosine_similarity
# Calcular la similitud coseno entre la consulta y los embeddings del corpus
cos_similarities = cosine_similarity(query_bert_embeddings, bert_embeddings)

ValueError: Found array with dim 3. check_pairwise_arrays expected <= 2.

Step 9: Retrieve and Rank Documents Based on Similarity Scores

Retrieve and rank the documents based on their similarity scores to the query.